Text Mining Conference Programme 2014 Final

download Text Mining Conference Programme 2014 Final

of 8

description

Text mining

Transcript of Text Mining Conference Programme 2014 Final

  • 4thLSE Conference on Text Mining Methods Text-Mining and Public Policy Analysis

    London School of Economics and Political Science

    Friday 27th June 2014

    Room: AGWR

    (Graham Wallas Room. OLD5.25)

    Contact: Dr. Aude Bicquelet: [email protected]

    The 4th LSE Text-Mining Conference will be an opportunity for Methodologists, Political

    Scientists and Data Miners to come together to discuss applications and experiences in

    using Text-Mining Methods for Public Policy analysis. The conference will examine the

    application of digital technologies to support social and policy research within three

    methodological strands: (1) Quantitative text analysis (2) Exploratory Content analysis

    (3) Qualitative analysis of Content. Presentations will focus on innovation in

    methodology and applications of Text-Mining techniques to analyse and inform public

    policy decisions.

  • 2

    Conference Programme

    Friday 27th June

    Welcome and Introductions

    Aude Bicquelet (9:45 10:00)

    London School of Economics, Department of Methodology

    Speakers:

    Paul Nulty & Monica Poletti (10:00-10:40)

    [email protected]; [email protected]

    LSE (Department of Methodology); LSE & University of Milan

    Title: The Immigration Issue in the European Electoral Campaign in the UK: Text-

    Mining Public Debate from Newspapers and Social Media

    Summary: In recent years, the issue of immigration has become increasingly salient in

    the UK political and media debate. Moreover, with the development and persistence of

    the economic and financial crisis within the EU, immigration has been linked to growing

    opposition and criticism towards the European Union. In a country in which

    Euroscepticism has historically been high compared to countries in continental Europe,

    EU immigration-related statements connected to EU free-border agreements became

    more widespread. For this reason, we expect immigration to be a prominent issue in the

    electoral campaign of the upcoming 2014 European Parliament elections in the media.

    By covering (potential) EU immigrants and EU immigration issues in a certain way,

    media tend to promote or restrain certain ideas of immigration,that might eventually

    affect public's views. In fact, we know from previous studies that immigration,

    particularly in times of economic crisis, is a challenge for society that can be framed not

    only in positive or negative terms, but also in economic or cultural terms.

    By looking at the news coverage of several newspapers, selected according to their

    political orientations and including both broadsheets and tabloids, this study first

    considers the salience of coverage of EU immigrants and EU immigration issues in UK

    newspapers in the three months preceding the EU elections of May 2014. It further

    explores whether news coverage of different newspapers is framed in terms of

    economic (e.g. jobs) or cultural (e.g. identity) terms. In addition, we mine information

    from social media sources (twitter and wikipedia) to discover how the immigration

    debate is framed by politically engaged members of the public on these platforms.

    Although inferring public opinion from social media can be problematic, it is a

    rewarding approach given the volume of public text available, and the ability to identify

    political affiliations through network connections. Understanding representation of

    immigration and its connections with public opinion is crucial not only in order to

    contribute to the scholarly debate on anti-immigration and Euroscepticism attitudes,

  • 3

    but also to better inform public policy decisions.

    In addition to these substantive contributions, this study employs advanced text-mining

    methods for data collection and data analysis. Text from politically engaged twitter

    users is collected using a novel iterative keyword-learning approach that allows high

    precision and recall when the aim is to discover only tweets relevant to a particular

    issue. This text, and the text from newspaper articles, is analysed using topic modelling

    to investigate the salience of sub-issues in the immigration debate. In addition, weighted

    log-odds ratios are used to measure linguistic differences across newspapers and social

    media users from different sides of the ideological spectrum in the UK.

    Alexander Herzog and Kenneth Benoit, LSE (10:40-11:20) [email protected]; [email protected]

    Title: The Most Unkindest Cuts: Government Cohesion and Economic Crisis

    Summary: Economic crisis and the resulting need for austerity budgets has divided

    many governing parties in Europe, despite the strict party discipline exercised over the

    legislative votes to approve these harsh budgets. Our analysis attempts to measure

    divisions in governing coalitions by applying automated text analysis methods to scale

    the positions that MPs express in budget debates. Our test case is Ireland, a country that

    has experienced both periods of rapid economic growth as well as one deep financial

    and economic crisis. Our analysis includes all annual budget debates during the time

    period from 1983 to 2013. We demonstrate that government cohesion as expressed

    through legislative speeches has significantly decreased as the economic crisis

    deepened, the result of government backbenchers expressing speaking against the

    painful austerity budgets introduced by their own governments. While ministers are

    bounded by the doctrine of collective cabinet responsibility and hence always vote for

    the finance ministers budget proposal, we find that party backbenchers position-taking

    is systematically related to the economic vulnerability of their constituencies and to the

    safety of their electoral margins.

    Coffee Break (11:20-11:30)

    Peter John et al.* (11:30 12:10)

    [email protected]

    UCL, School of Public Policy

    * Gurciullo, S., Herzog, A., John, P., Mikhaylov, S., Ward, H., Wolfe, P.

    Title: Policy topics and their networks: NLP and network analysis of UK House of

    Commons debates

    Summary: Over the last twenty years, there has been extensive debate about what are

    the core topics in public policy, and if they change according to type and number.

  • 4

    Furthermore, it is still not clear how policy topics mutate and diffuse over time. Human-

    based text analysis of governmental documents has shed only some light on these

    research questions: 21 general topics have been proposed, while other accounts tend to

    restrict it to about five main clusters. Yet, this approach has not been able to devise a

    reliable and systemic method to capture topic dynamics, diffusion, and their relation to

    the human actors that talk about them.

    This research is an exploratory effort to overcome such limitations, through the

    application of natural language processing and network analysis on a novel dataset

    containing UK House of Commons debates ranging from 1936 to 2014. The dataset

    contains all speeches made in the House of Commons as part of the official record, with

    corresponding attributes of speaker, time and debate IDs. Overall the dataset contains

    4.5 million speeches over the sample time period.

    The research proceeds in three main steps. First, Latent Dirichlet Allocation (LDA) is

    applied to the documents, which have been divided into yearly sessions and according

    to speaker. LDA is an algorithm that uncovers the hidden thematic structure in

    document collections, thus able to extract the underlying policy topics of the texts.

    Because LDA does not automatically divide the text into topics, the number of topics

    must be given by the user. After some trials, the authors opted for a 20-topic

    classification. Interestingly, results of the LDA analysis show the presence of time-

    invariant topics related to core policy areas, such as health, education, industry and

    defense. There also appear idiosyncratic topics, strictly related to specific events, such

    as the Iraq war in 2003. Second, for each session an undirected graph is constructed by

    using the information yielded by the topic analysis. Nodes in the graph are House of

    Commons speakers, while edges represent whether both speakers associated texts

    contain a significant portion of the same topic (where a significant portion is dictated by

    a threshold value of 10%). The resulting graphs are highly clustered ones, where

    speakers generally talking about the same topic form dense community structures. It is

    found that for each community there exist brokers, implying the presence of speakers

    who predominantly talk and bridge two or more topics. Socio-economic and social

    network data is used to investigate whether brokers are systematically different from

    the other nodes. Lastly, this paper presents a first attempt to model topic diffusion over

    time in a legislative setting. By drawing on the recent literature on information diffusion

    in social networks and the social media, the authors track the adoption of topic-specific

    tokens by the nodes across a network of speakers.

    Melanie Smallman (12:10 12:50)

    UCL, Department of Science and Technology Studies

    [email protected]

    Title: The policy relevance of public and expert discourses on new science and

    technology: Using IRAMUTEQ to understand the substance of 10 years of public

    deliberations and its value to policymaking.

  • 5

    Summary: Over the past 10 years, the UK government has funded a series of public

    deliberations on aspects of new science and technology, with the aim of gaining insight

    into public perceptions of scientific and technological issues and broadening the views

    feeding into policymaking. This move has drawn heavily on discussions within political

    science about the need for more participatory democracy. Despite being important to

    policy considerations, the question of what was discussed in deliberative activities has

    however been largely overlooked in the academic literature in favour of more

    measurable questions about how the deliberation proceeded. As a result, the content of

    such discussions has taken on a black box nature and details and evaluations of the

    usefulness of the substance of such discussions for policymakers is largely missing in

    the literature. These discussions are however well documented and extensive and

    therefore ideal subjects for text mining approaches.

    This paper sets out one such approach. Using the open-source statistical text analysis

    software IRAMUTEQ, we examine the public discourses around new technologies that

    are recorded and presented to policymakers in the reports of these public deliberations.

    We compare these discourses to those identified in the analogous expert reports, and

    those within Government statements on science and technology and consider whether

    the public discourses are reflected in the expert and public policy discourses and

    whether or not they have any relevance or value to policymaking. Methodologically, we

    also consider the usefulness of fictional illustrative statements, which we construct from

    the wordlists and characteristic text segments produced by IRAMUTEQ, in improving

    both the quality of our interpretation of the statistical results and in communicating the

    findings in a transparent and credible manner.

    Lunch (12:502:00 pm)

    Andrea Gobbo (2:00- 2:40 pm)

    [email protected]

    LSE, Institute of Social Psychology

    Title: Keywords and topics in 20 years of UK budget speeches.

    Summary: This paper proposes an analysis of British budget speeches that in the UK

    are traditionally delivered at the end of each fiscal year. The interest for this particular

    text refers to its ability to summarize the state of the UK economy in an articulated way

    comprising long term causes and effects. A vivid picture of the precedent year and

    recommendations for the following one are explained in exemplary terms, drawing

    from the current economic language. The aim of the analysis is therefore a linguistic

    assessment of the argument, in terms of words asociations and keywords. The analysis

    proposed combines topic extraction, comparison of authorship and topic migration

    across the years to outline the shifting concerns of politicians. The corpus is built on the

  • 6

    last 20 years of budget statements given by the chancellors of the exchequer Lamont,

    Clarke, Brown, Darling, Osborne. The argument is built around two methods of text

    analysis that refer to two unsupervised approaches of quantitative text analysis:

    correspondence analysis and network theory applied to n-grams.

    The idea behind the use of two different methods on a same corpus is to contrast results

    of the former more consolidated approach with the ones delivered by the latter method

    that has to be regarded as a new application of known methods of network analysis.

    Network theory is in fact used in text analysis mainly on aggregated variables like

    citations or authors but very limitedly on single words and even less on N-grams.

    Moreover, the network generated with one program is subsequently elaborated with

    advanced clustering algorithms and visualized with a different program.

    The co-occurrence analysis task is conducted with Iramuteq, the open-source program

    developed by Pierre Ratinaud at the university of Tolouse: the program can be

    considered the most advanced tool for quantitative text analysis in the spirit of the

    French school of textomtrie (Salem, Lebart, Reinert). The latter task of textual network

    analysis is performed through the combination of Vos-viewer, a Dutch scientometric

    analysis software and Pajek, the first-ever conceived program for network analysis

    (Batagelji). The joining of the two platforms enables a novel method of clusters

    visualization that appears particularly promising for small corpora. In this paper it is

    maintained that the specific differences between small (here the single speech) and

    large corpora (here exemplified by the sum of speeches) cannot appear by the use of

    same methods.

    While both methods are aimed at topic extraction and topic migration identification, the

    contention is that they also let different linguistic phenomena emerge. Results are in

    fact comparable with regard to the final interpretation stage of topic identification but

    rather contrasting with regard to the process of keywords classification. The term

    keyword is a modern label for a known linguistic phenomenon that has been

    technically called adjacent collocation (Sinclair). It is by relating keywords in a network

    of associations inside each speech that this analysis shows how an argument can be

    both persuasive at a policy level but also implicitly mirror the current social sentiment

    about the economy.

    Arnaud Vaganay (2:40 3:20 pm)

    [email protected]

    London School of Economics, Department of Methodology

    Title: Do policy researchers spin their results? A content analysis of six pilot evaluation

    reports.

    Summary: The reporting of interim evaluation outcomes creates an agency problem

    between policy evaluators (agents) and policymakers (principals) when a given reform

    is failing to fulfil its promises. Whereas the former are required to report outcomes fully

  • 7

    and without preconception, the latter have a vested interest in framing these outcomes

    in a positive light especially in cases when they previously expressed a commitment to

    the reform. The current evidence base is limited to a survey of policy evaluators and

    observational studies investigating the influence of industry sponsorship on the

    reporting of clinical trials.

    The objective of this study was twofold. Firstly, it aimed to assess the prevalence of

    outcome reporting bias (or spin) in pilot evaluation reports, using seven indicators

    developed by clinicians. Secondly, it sought to examine the possible effect the

    governments commitment to a given reform on the level of spin found in the

    corresponding evaluation report.

    To answer these questions, I analysed the content of six evaluation reports which all

    found a non-significant effect of the interventions on their stated primary outcomes.

    These reports were systematically selected from a dataset of over 230 pilot and

    experimental evaluations spanning three policy areas and 13 years of government-

    commissioned research in the UK. The final selection was made with a view to maximize

    the contrast between policy interventions showing a strong/weak commitment.

    Out of the seven indicators of spin used in this study, one could not be verified given the

    available information, four led to a forthright rejection of the notion of spin and two

    found evidence of spin, suggesting a limited prevalence. No substantial association was

    found between the level of commitment of the government to the reform and the level

    of spin found in the corresponding report. This paper provides an interpretation of

    these results in the light of previous empirical results as well as recommendations for

    the large-N study it aimed to inform.

    Coffee Break (3:20-3:30 pm)

    Bankole Falade (3:30 - 4:10 pm)

    [email protected]

    LSE, Institute of Social Psychology

    Title: MACAS-Mapping the cultural authority of science

    Summary: The UK corpus of the MACAS project follows up on the Science and

    technology in the British press - 1946 to 1990 project, with a view to mapping the

    character of science news from 1990 to 2012 and providing a structure, using new

    computerised coding methods, that accommodates comparison with future media

    surveys. A corpus of 8031 articles was obtained from NEXIS. Every second year from

    1990 to 2012 was selected for the sampling frame. Every 25th edition starting from an

    arbitrarily chosen date and using the Science and Technology filter was picked

    producing, in all, 14 editions of two constructed weeks. The coding employs a keyword

  • 8

    search and the QDA miner software. 342 science keywords have been identified for the

    analysis and were used to produce cross tabulations for countries, continents, actors,

    themes, scientific discipline and keywords per year from 1990 to 2012. The next stage

    of the analysis is to operationalise higher order text intuitions also on the basis of

    keyword mechanism, such as frames, dynamics of controversy, and the index for a

    sentiment analysis/evaluation. How to do that remains an open question?

    Aude Bicquelet (4:10 4:50 pm)

    [email protected]

    Department of Methodology

    Title: Textual Analysis* SAGE Benchmark texts (book launch)

    * Martin W. Bauer, Ahmet Suerdem, Aude Bicquelet

    Summary: The authors will discuss the publication of their new book Textual Analysis:

    Four Volumes Sage Benchmarks in Social Research Methods (London: Sage).

    This four-volume book mines the extensive research of the past few decades into textual

    analysis. The editors have collated seminal papers which consider the key difference

    between content analysis and textual analysis, the conceptual starting point and the

    logic and the attitude of the research process, as well as the tension between reading a

    text and using a text, among other key issues. The carefully selected papers in this

    collection are put into context and analysed in a newly-written introductory chapter

    which charts the developments and looks at the future of the field.

    4:50- 5:00 pm

    Concluding Remarks