Authorship analysis using function words forensic linguistics

All the Text’s a Stage; And All the Function Words Merely Players?

Statistical Analysis of Authorship

Vlad Mackevic

Aston University

Work of a Modern Forensic Linguist

Playing detective?

In forensic science – investigators look for clues that the culprit leaves unwittingly;

In linguistics – ‘unconscious language’

i.e. Function Words (de Vel, 2001; Argamon & Levitan, 2005; Burrows, 2003)

Rather old idea (Wallace & Mosteller, 1964); revisited in Holmes & Forsyth (1995).

Advantages of Function Words in FL

‘Unconscious language’

Numerous even in a relatively short text.

Can be easily counted

Related to the Daubert Criteria

Enables corpus analysis (Key Words in Context)

The Daubert Criteria

1. The theory must have been tested;

2. It must have been subjected to peer review and publication;

3. It must have a known error rate;

4. It must be generally accepted in the scientific community.

(Tiersma & Solan, 2002, cited in Coulthard, 2004; Chaski, 1997; Grant, 2007)

Implications for linguists

Increased pressure on the linguists to use mathematical methods, repeatable procedures;

Forensic linguists must serve justice;

‘Beyond reasonable doubt’ in criminal cases (Grant, 2010)

‘Raise legitimate doubt’ in civil cases (ibid.)

The method is King, not the expert.

It is ‘a challenge to the academic community to test the error rate and at the same time to fix an acceptable statistical equivalent for ‘beyond reasonable doubt’

Coulthard (2004: 476)

It is ‘the linguist’s responsibility to create theoretically sound hypotheses’ and test them

Chaski (2001: 2) .

Idiolect

Defined as the idiosyncratic use of dialect, idiolect is a way of speaking (and, consequently, writing) that is unique for each individual

Chaski (1997).

'the totality of the possible utterances of one speaker at one time in using a language to interact with one other speaker’

Bloch (1948, cited in Grieve,

2007: 255).

Theory

Grant (2010) - two theoretical frameworks:Idiolect is linked to neuroscience

The author is influenced by the language he/she is exposed to.

De Vel’s (2001) and Argamon & Levitan’s (2005) claims about certain function words being unconscious linguistic choices – also a theory.

Theory (cont.)

Grant (2010):

‘simple detection of consistency and determination of distinctiveness’ would be able to help practical authorship analysis more than even a strong theory.

Hypotheses

The use of function words is unique to each individual (could be limited by context or genre) - idiolect;

The frequency of certain function words is an authorship marker (e.g. Holmes & Forsyth, 1995);

The frequency of semantic roles that certain function words play is also an authorship marker.

Semantic Roles

Semantic roles are the word’s functions in the specific context of the sentence.

The words I analysed were AS, IT, THAT and THERE

Criteria: frequency (corpus) and explicit multiple meanings

ASFunction Examples

Start of time adjunct clause As we approached the small hut;as I followed the masses

Fixed Phrase as [adj/adv] as As easily as; as soon as, as well as

AS + Noun Phrase as a museum; as the red-light district

AS at the start of a manner adjunct as you can imagine; as the locals do

AS could be replaced with because big push for the Chinese people to learn English, as they have now made it mandatory in their schools

AS is used for comparison as if they knew we were on their turf; still as a board;the same as fall back in Chicago;

IT

Function Examples

IT serves as s dummy subject

IT + [to be] + predicament + infinitive It's hard to enjoy a festival the same way

IT + [to be] or other verb phrase (+ adj/noun phrase) + relative clause (that, if etc.)

It turns out I'll be going to at least four

IT + [to be] + time reference it's time for Pendulum

IT (cont.)

Function Examples

IT + seem/feel/any other perception verb

it stops feeling like Hannover

IT + [to be] + noun phrase it would have been a great day

IT refers to something mentioned before

We woke up early to catch the ferry and it couldn't have been easier.

IT is a part of a fixed phrase We made it to Macau in less than 2 hours

THAT

Function Examples

THAT begins a subordinate clause I also couldn't help but notice that when I looked toward the island

THAT could be replaced with which It was the spot on the beach that was shaped like a triangle

THAT is a determiner That night, we all reconvened at the hotel

THERE

Function Examples

THERE serves as a dummy subject there are a few longhaired dogs

THERE refers to a place it was there strictly for the tourists

My DatasetAuthor A Author B

Type of text Travel Blog Travel BlogGender (self-declared)

Female Male

Mother Tongue and variety (self-declared)

English (American) English (perhaps Irish)

Website URL - the data source

http://www.travelblog.org

http://www.getjealous.com

Size of K corpus

9 texts 7 texts 5 texts 3 texts Q text

Author A 20,875 16,118 11,024 6,260 2,479

Author B 7,991 6,176 4,241 2,611 750

http://www.travelblog.org/

http://www.travelblog.org/

http://www.getjealous.com/

http://www.getjealous.com/

Methodology

Texts were imported into TEXTSTAT concordance software;

Words AS, IT, THAT and THERE were chosen for their explicit diverse meanings in the sentence;

Quantitative analysis was used to determine how different (or similar) the authors were in terms of their frequency of use of function words and their meanings;

The number of texts was reduced to see if at some point analysis breaks down (compare to Grant, 2007);

Statistical technique used – T-TEST

Matrix of Probabilities

Application PSA values MeaningClustering PSA > 90% SuccessClustering and Differentiating

PSA ≥ 95% ‘Beyond Reasonable Doubt’

Differentiating PSA < 85% Definite Failure (error rate at 15% causes reasonable doubt).

Clustering and Differentiating

PSA > 50% Balance of probabilities – suitable for civil court.PSA = probability of same authorship

Clustering = the author of both texts is likely to be the same person

Differentiating = texts were written by different authors

Beyond reasonable doubt: 95%

Findings: T-Test

Clustering

Analysing each marker of the same author against the values of that marker in the Q text by the same author

How likely is that person to have produced the text?

Discriminating

Analysing each marker of the one author against the values of that marker in the Q text by the other author

How likely is that K and Q texts have been produced by the same person?

Findings: Reliability of markersAll texts by one author compared against each other

Every semantic role of each function word was included

Special attention: success of the test depends on the amount of text

Not all markers are reliable; their frequency can be too low in a short text

Marker Clustering Discrimination

AS Very inconsistent Consistent

IT Very consistent Very Consistent

THAT depends on the amount of text (A- yes; B - no)

depends on the amount of text (A- yes; B - no)

THERE Very consistent Very consistent

T-Test: Success

Beyond Reasonable Doubt: 95% or more

Function Word

Function Clustering Discriminating

A B

AS Start of time adjunct clause FAIL YES BRD NO BRDFixed Phrase as [adj/adv] as BRD FAIL FAIL YES BRDAS + Noun Phrase FAIL BRD YES YES NOAS at the start of a manner adjunct

FAIL YES BRD N/A NO

AS could be replaced with because

BRD BRD N/A N/A N/A

AS is used for comparison YES BRD BRD FAIL NO

Function Word

Function Clustering Discriminating

A B

ITDummy subject

YES YES BRD FAIL BRD

Dummy subject at the start of the sentence

FAIL FAIL FAIL FAIL NO

THAT That begins a subordinate clause

BRD YES FAIL FAIL NO

That could be replaced with which

FAIL FAIL BRD BRD BRD

That is a determiner FAIL FAIL FAIL YES BRD

THEREDummy subject

YES BRD N/A FAIL NO

Dummy subject at the start of the sentence

FAIL FAIL N/A FAIL BRD

Results

Marker Success Failure Explanation

AS 50% 33.33% A fairly reliable marker. Would do in civil court.

IT 80% 20% The most reliable marker in this study. IT at the start of the sentence has no linguistic theory behind it, and failure was expected.

THAT 46.67% 53.33% Also in Mackevic (2011):“Very unreliable across all authors – enormous error rates; PSA shooting over 50% most of the time. ”

THERE 30% 50% Marker totally unreliable.

Discussion of Results

Most of the markers – much better at discriminating that at clustering;

A lot depends of the text’s length– when I started removing texts from the corpus (9, then 7, then 5 and finally 3), analysis began breaking down;

6000 words for the reference corpus – approximate benchmark.

Possible conclusion: function words are really better for longer texts, which also occur in forensic settings.

Why did T-test fail?

Possible explanation: some markers occurred very rarely

They had little linguistic significance (no theory behind)

Analysis broke down with very consistent markers. Why?

Possibly, because the amount of text (number of words) was insufficient

For Comparison: Grant’s(2010) also reports his analysis breaking down when the amount of text is reduced

Perhaps qualitative analysis is better for shorter texts

But it works against the Daubert Criteria

Recommendations

Use grammar reference books for semantic roles of function words and more detailed division of roles

Choose different words (look what worked for other authors)

Try more texts, but short ones (e.g. 50 texts of 400 words each)

Try more statistical techniques

ConclusionFunction words – potentially another tool in a forensic linguist’s toolbox

T-Test – good analytical tool;

It returns exact results with certain error rates that are easy to interpret (consistent with Daubert criteria)

However, it also has some limitations and additional analysis may be needed to complete the picture

T-Test works with discriminating better than with clustering

Analysis breaks down with small corpora

References

NB: The references are from the original paper; some authors present in this list may not have been cited in the presentation

Books and Journals

Argamon, S. & Levitan, S. (2005) Measuring the Usefulness of Function Words for Authorship Attribution [Online]. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.6935&rep=rep1&type=pdf [Accessed 12 September 2010]

Burrows, J. (2003). Questions of Authorship: Attribution and Beyond. Computers and Humanities [Online] 37, pp. 5-23. Available from: http://www.springerlink.com/content/nv46t75125472350/ [Accessed 1 August 2010].

Chaski, C. E. (1997). Who Wrote It? Steps Towards a Science of Authorship Identification. National Institute of Justice Journal. (September Issue) [Online]. Available from: http://www.ncjrs.gov/pdffiles/jr000233.pdf [Accessed 31 January 2010].

Chaski, C. E. (2001). Empirical evaluations of language-based author identification techniques. The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 1-65. Available from: http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1690/1151 [Accessed 12 June 2008].

Chaski, C. E. (2005). Who’s at the Keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence [Online] 4 (1), pp. 1-14. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3852&rep=rep1&type=pdf [Accessed 31 January 2010].

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.6935&rep=rep1&type=pdf

http://www.springerlink.com/content/nv46t75125472350/

http://www.ncjrs.gov/pdffiles/jr000233.pdf

http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1690/1151


Coulthard, M. (1998). Identifying the Author. Cahiers de Linguistique Française [Online] 20, pp. 139-161. Available at: http://clf.unige.ch/display.php?idFichier=168 [Accessed 28 January 2010].

Coulthard, M. (2004). Author Identification, Idiolect and Linguistic Uniqueness. Applied Linguistics [Online] 25 (4), pp. 431-447. Available at: http://www.business-english.ch/downloads/Malcolm%20Coulthard/AppLing.art.final.pdf [Accessed 27 January 2010].

Coulthard, M. & Johnson, A. (2007). An Introduction to Forensic Linguistics: Language in Evidence. Abingdon: Routledge.

De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on Computer Security – Workshop on data mining for security applications. November 8, 2001.Phildelphia, PA [Online]. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed 31 August 2010].

Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at: http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].

Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The Independent [Online]. (Last updated 9 September 2009). Available at: http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-help-catch-murderers-923503.html [Accessed 11 September 2010].

Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic Lingusitics. Abingdon: Routledge


http://clf.unige.ch/display.php?idFichier=168

http://www.business-english.ch/downloads/Malcolm%20Coulthard/AppLing.art.final.pdf


http://www.equinoxjournals.com/IJSLL/article/view/3955/2428

http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-help-catch-murderers-923503.html




Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at: http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].

Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The Independent [Online]. (Last updated 9 September 2009). Available at: http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-help-catch-murderers-923503.html [Accessed 11 September 2010].

Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic Lingusitics. Abingdon: Routledge

Grant, T. & Baker, K. (2001). Identifying reliable, valid markers of authorship: a response to Chaski. The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 66-79. Available at: http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1691/1150 [Accessed 12 June 2008].

Holmes, D. I. & Forsyth, R. S. (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing [Online] 10 (2), pp. 111-127. Available from: http://llc.oxfordjournals.org/cgi/reprint/10/2/111 [Accessed 1 August 2010] .

Hunston, C. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.

Mitchell, E. (2008). The Case for Forensic Linguisitcs. BBC News [Online]. (Last updates 8 September 2008). Available at: http://news.bbc.co.uk/1/hi/sci/tech/7600769.stm [Accessed 11 September 2010]


http://www.equinoxjournals.com/IJSLL/article/view/3955/2428



http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1691/1150

http://llc.oxfordjournals.org/cgi/reprint/10/2/111

http://news.bbc.co.uk/1/hi/sci/tech/7600769.stm

Rudman, J. (1998). The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities [Online] 31, pp. 351–365. Available from: http://www.springerlink.com/content/l023q7047388133x/fulltext.pdf [Accessed 2 August 2010].

Websites:

Textstat

http://neon.niederlandistik.fu-berlin.de/textstat/

T-test Calculator

http://www.graphpad.com/quickcalcs/OneSampleT1.cfm

T-Tables

http://www.statsoft.com/textbook/distribution-tables/#t

http://www.springerlink.com/content/l023q7047388133x/fulltext.pdf

http://neon.niederlandistik.fu-berlin.de/textstat/

http://www.graphpad.com/quickcalcs/OneSampleT1.cfm

http://www.statsoft.com/textbook/distribution-tables/#t

Authorship analysis using function words forensic linguistics

Education

Transcript of Authorship analysis using function words forensic linguistics