Text Mining Conference Programme 2014 Final
-
Upload
pablo-azevedo -
Category
Documents
-
view
219 -
download
0
description
Transcript of Text Mining Conference Programme 2014 Final
-
4thLSE Conference on Text Mining Methods Text-Mining and Public Policy Analysis
London School of Economics and Political Science
Friday 27th June 2014
Room: AGWR
(Graham Wallas Room. OLD5.25)
Contact: Dr. Aude Bicquelet: [email protected]
The 4th LSE Text-Mining Conference will be an opportunity for Methodologists, Political
Scientists and Data Miners to come together to discuss applications and experiences in
using Text-Mining Methods for Public Policy analysis. The conference will examine the
application of digital technologies to support social and policy research within three
methodological strands: (1) Quantitative text analysis (2) Exploratory Content analysis
(3) Qualitative analysis of Content. Presentations will focus on innovation in
methodology and applications of Text-Mining techniques to analyse and inform public
policy decisions.
-
2
Conference Programme
Friday 27th June
Welcome and Introductions
Aude Bicquelet (9:45 10:00)
London School of Economics, Department of Methodology
Speakers:
Paul Nulty & Monica Poletti (10:00-10:40)
[email protected]; [email protected]
LSE (Department of Methodology); LSE & University of Milan
Title: The Immigration Issue in the European Electoral Campaign in the UK: Text-
Mining Public Debate from Newspapers and Social Media
Summary: In recent years, the issue of immigration has become increasingly salient in
the UK political and media debate. Moreover, with the development and persistence of
the economic and financial crisis within the EU, immigration has been linked to growing
opposition and criticism towards the European Union. In a country in which
Euroscepticism has historically been high compared to countries in continental Europe,
EU immigration-related statements connected to EU free-border agreements became
more widespread. For this reason, we expect immigration to be a prominent issue in the
electoral campaign of the upcoming 2014 European Parliament elections in the media.
By covering (potential) EU immigrants and EU immigration issues in a certain way,
media tend to promote or restrain certain ideas of immigration,that might eventually
affect public's views. In fact, we know from previous studies that immigration,
particularly in times of economic crisis, is a challenge for society that can be framed not
only in positive or negative terms, but also in economic or cultural terms.
By looking at the news coverage of several newspapers, selected according to their
political orientations and including both broadsheets and tabloids, this study first
considers the salience of coverage of EU immigrants and EU immigration issues in UK
newspapers in the three months preceding the EU elections of May 2014. It further
explores whether news coverage of different newspapers is framed in terms of
economic (e.g. jobs) or cultural (e.g. identity) terms. In addition, we mine information
from social media sources (twitter and wikipedia) to discover how the immigration
debate is framed by politically engaged members of the public on these platforms.
Although inferring public opinion from social media can be problematic, it is a
rewarding approach given the volume of public text available, and the ability to identify
political affiliations through network connections. Understanding representation of
immigration and its connections with public opinion is crucial not only in order to
contribute to the scholarly debate on anti-immigration and Euroscepticism attitudes,
-
3
but also to better inform public policy decisions.
In addition to these substantive contributions, this study employs advanced text-mining
methods for data collection and data analysis. Text from politically engaged twitter
users is collected using a novel iterative keyword-learning approach that allows high
precision and recall when the aim is to discover only tweets relevant to a particular
issue. This text, and the text from newspaper articles, is analysed using topic modelling
to investigate the salience of sub-issues in the immigration debate. In addition, weighted
log-odds ratios are used to measure linguistic differences across newspapers and social
media users from different sides of the ideological spectrum in the UK.
Alexander Herzog and Kenneth Benoit, LSE (10:40-11:20) [email protected]; [email protected]
Title: The Most Unkindest Cuts: Government Cohesion and Economic Crisis
Summary: Economic crisis and the resulting need for austerity budgets has divided
many governing parties in Europe, despite the strict party discipline exercised over the
legislative votes to approve these harsh budgets. Our analysis attempts to measure
divisions in governing coalitions by applying automated text analysis methods to scale
the positions that MPs express in budget debates. Our test case is Ireland, a country that
has experienced both periods of rapid economic growth as well as one deep financial
and economic crisis. Our analysis includes all annual budget debates during the time
period from 1983 to 2013. We demonstrate that government cohesion as expressed
through legislative speeches has significantly decreased as the economic crisis
deepened, the result of government backbenchers expressing speaking against the
painful austerity budgets introduced by their own governments. While ministers are
bounded by the doctrine of collective cabinet responsibility and hence always vote for
the finance ministers budget proposal, we find that party backbenchers position-taking
is systematically related to the economic vulnerability of their constituencies and to the
safety of their electoral margins.
Coffee Break (11:20-11:30)
Peter John et al.* (11:30 12:10)
UCL, School of Public Policy
* Gurciullo, S., Herzog, A., John, P., Mikhaylov, S., Ward, H., Wolfe, P.
Title: Policy topics and their networks: NLP and network analysis of UK House of
Commons debates
Summary: Over the last twenty years, there has been extensive debate about what are
the core topics in public policy, and if they change according to type and number.
-
4
Furthermore, it is still not clear how policy topics mutate and diffuse over time. Human-
based text analysis of governmental documents has shed only some light on these
research questions: 21 general topics have been proposed, while other accounts tend to
restrict it to about five main clusters. Yet, this approach has not been able to devise a
reliable and systemic method to capture topic dynamics, diffusion, and their relation to
the human actors that talk about them.
This research is an exploratory effort to overcome such limitations, through the
application of natural language processing and network analysis on a novel dataset
containing UK House of Commons debates ranging from 1936 to 2014. The dataset
contains all speeches made in the House of Commons as part of the official record, with
corresponding attributes of speaker, time and debate IDs. Overall the dataset contains
4.5 million speeches over the sample time period.
The research proceeds in three main steps. First, Latent Dirichlet Allocation (LDA) is
applied to the documents, which have been divided into yearly sessions and according
to speaker. LDA is an algorithm that uncovers the hidden thematic structure in
document collections, thus able to extract the underlying policy topics of the texts.
Because LDA does not automatically divide the text into topics, the number of topics
must be given by the user. After some trials, the authors opted for a 20-topic
classification. Interestingly, results of the LDA analysis show the presence of time-
invariant topics related to core policy areas, such as health, education, industry and
defense. There also appear idiosyncratic topics, strictly related to specific events, such
as the Iraq war in 2003. Second, for each session an undirected graph is constructed by
using the information yielded by the topic analysis. Nodes in the graph are House of
Commons speakers, while edges represent whether both speakers associated texts
contain a significant portion of the same topic (where a significant portion is dictated by
a threshold value of 10%). The resulting graphs are highly clustered ones, where
speakers generally talking about the same topic form dense community structures. It is
found that for each community there exist brokers, implying the presence of speakers
who predominantly talk and bridge two or more topics. Socio-economic and social
network data is used to investigate whether brokers are systematically different from
the other nodes. Lastly, this paper presents a first attempt to model topic diffusion over
time in a legislative setting. By drawing on the recent literature on information diffusion
in social networks and the social media, the authors track the adoption of topic-specific
tokens by the nodes across a network of speakers.
Melanie Smallman (12:10 12:50)
UCL, Department of Science and Technology Studies
Title: The policy relevance of public and expert discourses on new science and
technology: Using IRAMUTEQ to understand the substance of 10 years of public
deliberations and its value to policymaking.
-
5
Summary: Over the past 10 years, the UK government has funded a series of public
deliberations on aspects of new science and technology, with the aim of gaining insight
into public perceptions of scientific and technological issues and broadening the views
feeding into policymaking. This move has drawn heavily on discussions within political
science about the need for more participatory democracy. Despite being important to
policy considerations, the question of what was discussed in deliberative activities has
however been largely overlooked in the academic literature in favour of more
measurable questions about how the deliberation proceeded. As a result, the content of
such discussions has taken on a black box nature and details and evaluations of the
usefulness of the substance of such discussions for policymakers is largely missing in
the literature. These discussions are however well documented and extensive and
therefore ideal subjects for text mining approaches.
This paper sets out one such approach. Using the open-source statistical text analysis
software IRAMUTEQ, we examine the public discourses around new technologies that
are recorded and presented to policymakers in the reports of these public deliberations.
We compare these discourses to those identified in the analogous expert reports, and
those within Government statements on science and technology and consider whether
the public discourses are reflected in the expert and public policy discourses and
whether or not they have any relevance or value to policymaking. Methodologically, we
also consider the usefulness of fictional illustrative statements, which we construct from
the wordlists and characteristic text segments produced by IRAMUTEQ, in improving
both the quality of our interpretation of the statistical results and in communicating the
findings in a transparent and credible manner.
Lunch (12:502:00 pm)
Andrea Gobbo (2:00- 2:40 pm)
LSE, Institute of Social Psychology
Title: Keywords and topics in 20 years of UK budget speeches.
Summary: This paper proposes an analysis of British budget speeches that in the UK
are traditionally delivered at the end of each fiscal year. The interest for this particular
text refers to its ability to summarize the state of the UK economy in an articulated way
comprising long term causes and effects. A vivid picture of the precedent year and
recommendations for the following one are explained in exemplary terms, drawing
from the current economic language. The aim of the analysis is therefore a linguistic
assessment of the argument, in terms of words asociations and keywords. The analysis
proposed combines topic extraction, comparison of authorship and topic migration
across the years to outline the shifting concerns of politicians. The corpus is built on the
-
6
last 20 years of budget statements given by the chancellors of the exchequer Lamont,
Clarke, Brown, Darling, Osborne. The argument is built around two methods of text
analysis that refer to two unsupervised approaches of quantitative text analysis:
correspondence analysis and network theory applied to n-grams.
The idea behind the use of two different methods on a same corpus is to contrast results
of the former more consolidated approach with the ones delivered by the latter method
that has to be regarded as a new application of known methods of network analysis.
Network theory is in fact used in text analysis mainly on aggregated variables like
citations or authors but very limitedly on single words and even less on N-grams.
Moreover, the network generated with one program is subsequently elaborated with
advanced clustering algorithms and visualized with a different program.
The co-occurrence analysis task is conducted with Iramuteq, the open-source program
developed by Pierre Ratinaud at the university of Tolouse: the program can be
considered the most advanced tool for quantitative text analysis in the spirit of the
French school of textomtrie (Salem, Lebart, Reinert). The latter task of textual network
analysis is performed through the combination of Vos-viewer, a Dutch scientometric
analysis software and Pajek, the first-ever conceived program for network analysis
(Batagelji). The joining of the two platforms enables a novel method of clusters
visualization that appears particularly promising for small corpora. In this paper it is
maintained that the specific differences between small (here the single speech) and
large corpora (here exemplified by the sum of speeches) cannot appear by the use of
same methods.
While both methods are aimed at topic extraction and topic migration identification, the
contention is that they also let different linguistic phenomena emerge. Results are in
fact comparable with regard to the final interpretation stage of topic identification but
rather contrasting with regard to the process of keywords classification. The term
keyword is a modern label for a known linguistic phenomenon that has been
technically called adjacent collocation (Sinclair). It is by relating keywords in a network
of associations inside each speech that this analysis shows how an argument can be
both persuasive at a policy level but also implicitly mirror the current social sentiment
about the economy.
Arnaud Vaganay (2:40 3:20 pm)
London School of Economics, Department of Methodology
Title: Do policy researchers spin their results? A content analysis of six pilot evaluation
reports.
Summary: The reporting of interim evaluation outcomes creates an agency problem
between policy evaluators (agents) and policymakers (principals) when a given reform
is failing to fulfil its promises. Whereas the former are required to report outcomes fully
-
7
and without preconception, the latter have a vested interest in framing these outcomes
in a positive light especially in cases when they previously expressed a commitment to
the reform. The current evidence base is limited to a survey of policy evaluators and
observational studies investigating the influence of industry sponsorship on the
reporting of clinical trials.
The objective of this study was twofold. Firstly, it aimed to assess the prevalence of
outcome reporting bias (or spin) in pilot evaluation reports, using seven indicators
developed by clinicians. Secondly, it sought to examine the possible effect the
governments commitment to a given reform on the level of spin found in the
corresponding evaluation report.
To answer these questions, I analysed the content of six evaluation reports which all
found a non-significant effect of the interventions on their stated primary outcomes.
These reports were systematically selected from a dataset of over 230 pilot and
experimental evaluations spanning three policy areas and 13 years of government-
commissioned research in the UK. The final selection was made with a view to maximize
the contrast between policy interventions showing a strong/weak commitment.
Out of the seven indicators of spin used in this study, one could not be verified given the
available information, four led to a forthright rejection of the notion of spin and two
found evidence of spin, suggesting a limited prevalence. No substantial association was
found between the level of commitment of the government to the reform and the level
of spin found in the corresponding report. This paper provides an interpretation of
these results in the light of previous empirical results as well as recommendations for
the large-N study it aimed to inform.
Coffee Break (3:20-3:30 pm)
Bankole Falade (3:30 - 4:10 pm)
LSE, Institute of Social Psychology
Title: MACAS-Mapping the cultural authority of science
Summary: The UK corpus of the MACAS project follows up on the Science and
technology in the British press - 1946 to 1990 project, with a view to mapping the
character of science news from 1990 to 2012 and providing a structure, using new
computerised coding methods, that accommodates comparison with future media
surveys. A corpus of 8031 articles was obtained from NEXIS. Every second year from
1990 to 2012 was selected for the sampling frame. Every 25th edition starting from an
arbitrarily chosen date and using the Science and Technology filter was picked
producing, in all, 14 editions of two constructed weeks. The coding employs a keyword
-
8
search and the QDA miner software. 342 science keywords have been identified for the
analysis and were used to produce cross tabulations for countries, continents, actors,
themes, scientific discipline and keywords per year from 1990 to 2012. The next stage
of the analysis is to operationalise higher order text intuitions also on the basis of
keyword mechanism, such as frames, dynamics of controversy, and the index for a
sentiment analysis/evaluation. How to do that remains an open question?
Aude Bicquelet (4:10 4:50 pm)
Department of Methodology
Title: Textual Analysis* SAGE Benchmark texts (book launch)
* Martin W. Bauer, Ahmet Suerdem, Aude Bicquelet
Summary: The authors will discuss the publication of their new book Textual Analysis:
Four Volumes Sage Benchmarks in Social Research Methods (London: Sage).
This four-volume book mines the extensive research of the past few decades into textual
analysis. The editors have collated seminal papers which consider the key difference
between content analysis and textual analysis, the conceptual starting point and the
logic and the attitude of the research process, as well as the tension between reading a
text and using a text, among other key issues. The carefully selected papers in this
collection are put into context and analysed in a newly-written introductory chapter
which charts the developments and looks at the future of the field.
4:50- 5:00 pm
Concluding Remarks