Text Mining Conference Programme 2014 Final

4thLSE Conference on Text Mining Methods Text-Mining and Public Policy Analysis

London School of Economics and Political Science

Friday 27th June 2014

Room: AGWR

(Graham Wallas Room. OLD5.25)

Contact: Dr. Aude Bicquelet: [email protected]

The 4th LSE Text-Mining Conference will be an opportunity for Methodologists, Political

Scientists and Data Miners to come together to discuss applications and experiences in

using Text-Mining Methods for Public Policy analysis. The conference will examine the

application of digital technologies to support social and policy research within three

methodological strands: (1) Quantitative text analysis (2) Exploratory Content analysis

(3) Qualitative analysis of Content. Presentations will focus on innovation in

methodology and applications of Text-Mining techniques to analyse and inform public

policy decisions.

2

Conference Programme

Friday 27th June

Welcome and Introductions

Aude Bicquelet (9:45 10:00)

London School of Economics, Department of Methodology

Speakers:

Paul Nulty & Monica Poletti (10:00-10:40)

[email protected]; [email protected]

LSE (Department of Methodology); LSE & University of Milan

Title: The Immigration Issue in the European Electoral Campaign in the UK: Text-

Mining Public Debate from Newspapers and Social Media

Summary: In recent years, the issue of immigration has become increasingly salient in

the UK political and media debate. Moreover, with the development and persistence of

the economic and financial crisis within the EU, immigration has been linked to growing

opposition and criticism towards the European Union. In a country in which

Euroscepticism has historically been high compared to countries in continental Europe,

EU immigration-related statements connected to EU free-border agreements became

more widespread. For this reason, we expect immigration to be a prominent issue in the

electoral campaign of the upcoming 2014 European Parliament elections in the media.

By covering (potential) EU immigrants and EU immigration issues in a certain way,

media tend to promote or restrain certain ideas of immigration,that might eventually

affect public's views. In fact, we know from previous studies that immigration,

particularly in times of economic crisis, is a challenge for society that can be framed not

only in positive or negative terms, but also in economic or cultural terms.

By looking at the news coverage of several newspapers, selected according to their

political orientations and including both broadsheets and tabloids, this study first

considers the salience of coverage of EU immigrants and EU immigration issues in UK

newspapers in the three months preceding the EU elections of May 2014. It further

explores whether news coverage of different newspapers is framed in terms of

economic (e.g. jobs) or cultural (e.g. identity) terms. In addition, we mine information

from social media sources (twitter and wikipedia) to discover how the immigration

debate is framed by politically engaged members of the public on these platforms.

Although inferring public opinion from social media can be problematic, it is a

rewarding approach given the volume of public text available, and the ability to identify

political affiliations through network connections. Understanding representation of

immigration and its connections with public opinion is crucial not only in order to

contribute to the scholarly debate on anti-immigration and Euroscepticism attitudes,

3

but also to better inform public policy decisions.

In addition to these substantive contributions, this study employs advanced text-mining

methods for data collection and data analysis. Text from politically engaged twitter

users is collected using a novel iterative keyword-learning approach that allows high

precision and recall when the aim is to discover only tweets relevant to a particular

issue. This text, and the text from newspaper articles, is analysed using topic modelling

to investigate the salience of sub-issues in the immigration debate. In addition, weighted

log-odds ratios are used to measure linguistic differences across newspapers and social

media users from different sides of the ideological spectrum in the UK.

Alexander Herzog and Kenneth Benoit, LSE (10:40-11:20) [email protected]; [email protected]

Title: The Most Unkindest Cuts: Government Cohesion and Economic Crisis

Summary: Economic crisis and the resulting need for austerity budgets has divided

many governing parties in Europe, despite the strict party discipline exercised over the

legislative votes to approve these harsh budgets. Our analysis attempts to measure

divisions in governing coalitions by applying automated text analysis methods to scale

the positions that MPs express in budget debates. Our test case is Ireland, a country that

has experienced both periods of rapid economic growth as well as one deep financial

and economic crisis. Our analysis includes all annual budget debates during the time

period from 1983 to 2013. We demonstrate that government cohesion as expressed

through legislative speeches has significantly decreased as the economic crisis

deepened, the result of government backbenchers expressing speaking against the

painful austerity budgets introduced by their own governments. While ministers are

bounded by the doctrine of collective cabinet responsibility and hence always vote for

the finance ministers budget proposal, we find that party backbenchers position-taking

is systematically related to the economic vulnerability of their constituencies and to the

safety of their electoral margins.

Coffee Break (11:20-11:30)

Peter John et al.* (11:30 12:10)

[email protected]

UCL, School of Public Policy

* Gurciullo, S., Herzog, A., John, P., Mikhaylov, S., Ward, H., Wolfe, P.

Title: Policy topics and their networks: NLP and network analysis of UK House of

Commons debates

Summary: Over the last twenty years, there has been extensive debate about what are

the core topics in public policy, and if they change according to type and number.

4

Furthermore, it is still not clear how policy topics mutate and diffuse over time. Human-

based text analysis of governmental documents has shed only some light on these

research questions: 21 general topics have been proposed, while other accounts tend to

restrict it to about five main clusters. Yet, this approach has not been able to devise a

reliable and systemic method to capture topic dynamics, diffusion, and their relation to

the human actors that talk about them.

This research is an exploratory effort to overcome such limitations, through the

application of natural language processing and network analysis on a novel dataset

containing UK House of Commons debates ranging from 1936 to 2014. The dataset

contains all speeches made in the House of Commons as part of the official record, with

corresponding attributes of speaker, time and debate IDs. Overall the dataset contains

4.5 million speeches over the sample time period.

The research proceeds in three main steps. First, Latent Dirichlet Allocation (LDA) is

applied to the documents, which have been divided into yearly sessions and according

to speaker. LDA is an algorithm that uncovers the hidden thematic structure in

document collections, thus able to extract the underlying policy topics of the texts.

Because LDA does not automatically divide the text into topics, the number of topics

must be given by the user. After some trials, the authors opted for a 20-topic

classification. Interestingly, results of the LDA analysis show the presence of time-

invariant topics related to core policy areas, such as health, education, industry and

defense. There also appear idiosyncratic topics, strictly related to specific events, such

as the Iraq war in 2003. Second, for each session an undirected graph is constructed by

using the information yielded by the topic analysis. Nodes in the graph are House of

Commons speakers, while edges represent whether both speakers associated texts

contain a significant portion of the same topic (where a significant portion is dictated by

a threshold value of 10%). The resulting graphs are highly clustered ones, where

speakers generally talking about the same topic form dense community structures. It is

found that for each community there exist brokers, implying the presence of speakers

who predominantly talk and bridge two or more topics. Socio-economic and social

network data is used to investigate whether brokers are systematically different from

the other nodes. Lastly, this paper presents a first attempt to model topic diffusion over

time in a legislative setting. By drawing on the recent literature on information diffusion

in social networks and the social media, the authors track the adoption of topic-specific

tokens by the nodes across a network of speakers.

Melanie Smallman (12:10 12:50)

UCL, Department of Science and Technology Studies

[email protected]

Title: The policy relevance of public and expert discourses on new science and

technology: Using IRAMUTEQ to understand the substance of 10 years of public

deliberations and its value to policymaking.

5

Summary: Over the past 10 years, the UK government has funded a series of public

deliberations on aspects of new science and technology, with the aim of gaining insight

into public perceptions of scientific and technological issues and broadening the views

feeding into policymaking. This move has drawn heavily on discussions within political

science about the need for more participatory democracy. Despite being important to

policy considerations, the question of what was discussed in deliberative activities has

however been largely overlooked in the academic literature in favour of more

measurable questions about how the deliberation proceeded. As a result, the content of

such discussions has taken on a black box nature and details and evaluations of the

usefulness of the substance of such discussions for policymakers is largely missing in

the literature. These discussions are however well documented and extensive and

therefore ideal subjects for text mining approaches.

This paper sets out one such approach. Using the open-source statistical text analysis

software IRAMUTEQ, we examine the public discourses around new technologies that

are recorded and presented to policymakers in the reports of these public deliberations.

We compare these discourses to those identified in the analogous expert reports, and

those within Government statements on science and technology and consider whether

the public discourses are reflected in the expert and public policy discourses and

whether or not they have any relevance or value to policymaking. Methodologically, we

also consider the usefulness of fictional illustrative statements, which we construct from

the wordlists and characteristic text segments produced by IRAMUTEQ, in improving

both the quality of our interpretation of the statistical results and in communicating the

findings in a transparent and credible manner.

Lunch (12:502:00 pm)

Andrea Gobbo (2:00- 2:40 pm)

[email protected]

LSE, Institute of Social Psychology

Title: Keywords and topics in 20 years of UK budget speeches.

Summary: This paper proposes an analysis of British budget speeches that in the UK

are traditionally delivered at the end of each fiscal year. The interest for this particular

text refers to its ability to summarize the state of the UK economy in an articulated way

comprising long term causes and effects. A vivid picture of the precedent year and

recommendations for the following one are explained in exemplary terms, drawing

from the current economic language. The aim of the analysis is therefore a linguistic

assessment of the argument, in terms of words asociations and keywords. The analysis

proposed combines topic extraction, comparison of authorship and topic migration

across the years to outline the shifting concerns of politicians. The corpus is built on the

6

last 20 years of budget statements given by the chancellors of the exchequer Lamont,

Clarke, Brown, Darling, Osborne. The argument is built around two methods of text

analysis that refer to two unsupervised approaches of quantitative text analysis:

correspondence analysis and network theory applied to n-grams.

The idea behind the use of two different methods on a same corpus is to contrast results

of the former more consolidated approach with the ones delivered by the latter method

that has to be regarded as a new application of known methods of network analysis.

Network theory is in fact used in text analysis mainly on aggregated variables like

citations or authors but very limitedly on single words and even less on N-grams.

Moreover, the network generated with one program is subsequently elaborated with

advanced clustering algorithms and visualized with a different program.

The co-occurrence analysis task is conducted with Iramuteq, the open-source program

developed by Pierre Ratinaud at the university of Tolouse: the program can be

considered the most advanced tool for quantitative text analysis in the spirit of the

French school of textomtrie (Salem, Lebart, Reinert). The latter task of textual network

analysis is performed through the combination of Vos-viewer, a Dutch scientometric

analysis software and Pajek, the first-ever conceived program for network analysis

(Batagelji). The joining of the two platforms enables a novel method of clusters

visualization that appears particularly promising for small corpora. In this paper it is

maintained that the specific differences between small (here the single speech) and

large corpora (here exemplified by the sum of speeches) cannot appear by the use of

same methods.

While both methods are aimed at topic extraction and topic migration identification, the

contention is that they also let different linguistic phenomena emerge. Results are in

fact comparable with regard to the final interpretation stage of topic identification but

rather contrasting with regard to the process of keywords classification. The term

keyword is a modern label for a known linguistic phenomenon that has been

technically called adjacent collocation (Sinclair). It is by relating keywords in a network

of associations inside each speech that this analysis shows how an argument can be

both persuasive at a policy level but also implicitly mirror the current social sentiment

about the economy.

Arnaud Vaganay (2:40 3:20 pm)

[email protected]

London School of Economics, Department of Methodology

Title: Do policy researchers spin their results? A content analysis of six pilot evaluation

reports.

Summary: The reporting of interim evaluation outcomes creates an agency problem

between policy evaluators (agents) and policymakers (principals) when a given reform

is failing to fulfil its promises. Whereas the former are required to report outcomes fully

7

and without preconception, the latter have a vested interest in framing these outcomes

in a positive light especially in cases when they previously expressed a commitment to

the reform. The current evidence base is limited to a survey of policy evaluators and

observational studies investigating the influence of industry sponsorship on the

reporting of clinical trials.

The objective of this study was twofold. Firstly, it aimed to assess the prevalence of

outcome reporting bias (or spin) in pilot evaluation reports, using seven indicators

developed by clinicians. Secondly, it sought to examine the possible effect the

governments commitment to a given reform on the level of spin found in the

corresponding evaluation report.

To answer these questions, I analysed the content of six evaluation reports which all

found a non-significant effect of the interventions on their stated primary outcomes.

These reports were systematically selected from a dataset of over 230 pilot and

experimental evaluations spanning three policy areas and 13 years of government-

commissioned research in the UK. The final selection was made with a view to maximize

the contrast between policy interventions showing a strong/weak commitment.

Out of the seven indicators of spin used in this study, one could not be verified given the

available information, four led to a forthright rejection of the notion of spin and two

found evidence of spin, suggesting a limited prevalence. No substantial association was

found between the level of commitment of the government to the reform and the level

of spin found in the corresponding report. This paper provides an interpretation of

these results in the light of previous empirical results as well as recommendations for

the large-N study it aimed to inform.

Coffee Break (3:20-3:30 pm)

Bankole Falade (3:30 - 4:10 pm)

[email protected]

LSE, Institute of Social Psychology

Title: MACAS-Mapping the cultural authority of science

Summary: The UK corpus of the MACAS project follows up on the Science and

technology in the British press - 1946 to 1990 project, with a view to mapping the

character of science news from 1990 to 2012 and providing a structure, using new

computerised coding methods, that accommodates comparison with future media

surveys. A corpus of 8031 articles was obtained from NEXIS. Every second year from

1990 to 2012 was selected for the sampling frame. Every 25th edition starting from an

arbitrarily chosen date and using the Science and Technology filter was picked

producing, in all, 14 editions of two constructed weeks. The coding employs a keyword

8

search and the QDA miner software. 342 science keywords have been identified for the

analysis and were used to produce cross tabulations for countries, continents, actors,

themes, scientific discipline and keywords per year from 1990 to 2012. The next stage

of the analysis is to operationalise higher order text intuitions also on the basis of

keyword mechanism, such as frames, dynamics of controversy, and the index for a

sentiment analysis/evaluation. How to do that remains an open question?

Aude Bicquelet (4:10 4:50 pm)

[email protected]

Department of Methodology

Title: Textual Analysis* SAGE Benchmark texts (book launch)

* Martin W. Bauer, Ahmet Suerdem, Aude Bicquelet

Summary: The authors will discuss the publication of their new book Textual Analysis:

Four Volumes Sage Benchmarks in Social Research Methods (London: Sage).

This four-volume book mines the extensive research of the past few decades into textual

analysis. The editors have collated seminal papers which consider the key difference

between content analysis and textual analysis, the conceptual starting point and the

logic and the attitude of the research process, as well as the tension between reading a

text and using a text, among other key issues. The carefully selected papers in this

collection are put into context and analysed in a newly-written introductory chapter

which charts the developments and looks at the future of the field.

4:50- 5:00 pm

Concluding Remarks

Text Mining Conference Programme 2014 Final

Documents

Transcript of Text Mining Conference Programme 2014 Final