A Twitter Sentiment Gold Standard for the Brexit Referendum

In this paper, we present a sentiment-annotated Twitter gold standard for the Brexit referendum. The data set consists of 2,000 Twitter messages ("tweets") annotated with information about the sentiment expressed, the strength of the sentiment, and context dependence. This is a valuable resource for social media-based opinion mining in the context of political events.


INTRODUCTION
Popular referenda provide a rich setting for understanding the social and discourse dynamics behind a focussed political discussion. Under these settings, opinion mining and sentiment analysis over social media are fundamental tools to provide systematic prospective and retrospective insights, supporting an analysis of the underlying political processes and dynamics.
However, referendum-type events require the application of different techniques and resources for opinion mining and sentiment analysis as these events have a distinctive social dynamics and political discourse. The availability of language resources to ground the discourse analysis, and the construction of supervised classification methods play a fundamental role for pushing forward our ability to build systems which can support the interpretation of social media discourse.
Aiming to support the evolution of the classification methods, this paper presents a dataset of sentiment-annotated social posts targeting the historical event of the United Kingdom European Union membership referendum ("Brexit"), which took place on 1 June 23, 2016. Data collection and annotation were carried out in the context of the SSIX H2020 project [2], which targets the creation of customisable sentiment metrics for social media. The dataset, containing 2,000 annotated tweets, was collected from Twitter prior to the event. The dataset has been published under 2 the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licence , for general use. 3

RELATED WORK
A number of datasets have been created in the context of Brexit. The #ImagineEurope project [3,4] collected a dataset of tweets 4 using a set of hashtags that are related to the Brexit referendum, such as the ones used by the leave and remain campaigns. Examples of hashtags include #migrant, #refugee, #strongerin, #leadnotleave. The collection of this dataset was initiated several months prior to the referendum (7th August 2015).
A Twitter-based dataset with a wider scope is the Twitter political corpus [6]. It was collected during 2009 and consists of two subcorpora. The first is a randomly selected set of 2000 tweets from Twitter's "spritzer" feed collected between June 1, 2009 and Dec 31, 2009, while the tweets for the second corpus were randomly selected from a subset of tweets which contained at least one political keyword each. The aim of this study was to develop learning algorithms which link political statements on Twitter to general opinions about government and politicians. A http://www.legislation.gov.uk/ukpga/2015/36/crossheading/the-referendum 1 https://bitbucket.org/ssix-project/brexit-gold-standard classifier was trained on these corpora for sentiment analysis, which was used for predicting presidential approval polls.
Several researchers have studied the social media debate surrounding Brexit. In [5], the authors presented a demonstrator which visualises the Twittersphere debate on whether the UK should remain in or leave the European Union. It shows the different discussion topics identified by the different search strategies of the collected data, namely hashtag search terms, extraction from the full stream and following specific users.
A study by [7] analysed over 1.5 million tweets related to the referendum up until 13 May 2016. Tweets were identified by common referendum-related terms, such as #brexit; #StrongerIn; "uk eu vote", and then analysed for the use of hashtags, citations, and general sentiment for leaving or remaining in the EU. Results from the study suggest that the Brexit referendum is very close, with a clear preference among Twitter users for leaving the EU.
The real-time monitoring of the Brexit campaign set up by [10] showed indications that 3 hours before the polls closed in the UK, the split between #VoteRemain and #VoteLeave was roughly 40:60 for the last four hours of voting. This study also dealt with parallel investigations and analyses of how the market was reacting and how currencies were changing, which are of real, actionable value to financial firms including hedge funds, government bodies, politicians, and policy makers.

METHOD 3.1. Sampling and Filtering
In order to collect the data set, we sampled 2000 tweets uniformly at random from a Twitter stream which was set up to track 75 keywords, including hashtags and account names. Criteria for chosen keywords were based on the manual identification of common keywords associated with content relevant to Brexit, for example #eureferendum, #votein, #voteleave. Appendix A provides a list of Twitter tracking keywords (hashtags and Twitter handles). Data collection on this stream between May 4 and May 6, 2016 (inclusive) resulted in a population of 149,331 tweets. Before sampling, filters were applied to exclude spam and irrelevant content: for example, we discarded very short contents (less than three characters) and users with suspiciously high activity, (i.e. more than 100 tweets per day, which should exclude most spammers but include prolific real tweeters). Furthermore, only tweets published between 6am and 11pm GMT were considered in order to increase coverage of European postings. These measures reduced the population for sampling to 20,104 tweets.

Annotation
The 2000 tweets thus sampled were presented to three annotators, all proficient in English, who created the following annotations for each tweet.  Strength (only for tweets classified as "stay" or "leave"): An integer between 1 (very weak) and 5 (very strong) expressing the strength of the "stay" or "leave" sentiment Contextual dependency: one of the following 5 • 0: interpretation of sentiment in tweet does not depend on external sources • 1: interpretation of sentiment in tweet depends on external sources (e.g. articles or images that are linked) The five opinion categories and strength annotations support a fine-grained view on the opinion landscape. Furthermore, the contextual dependency option provides an indication of the difficulty of scoring a tweet, which is a fundamental feature for the construction of opinion mining classifiers.

Agreement and Consolidation
In Table 1 we present the standard inter-rater agreement metrics for each of the annotations.

Table 1: Inter-rater metrics for each annotation type
We achieve moderate Fleiss agreement for sentiment and strength, and fair agreement for context. Average observed agreement gives an indication of the difficulty of this annotation task. The strength assignment is the most difficult, while context dependency is relatively straightforward to determine. Table 2 below shows the distribution of tweets with regard to the number of annotators who agreed on its opinion annotation, providing a different view of agreement. For the contextual dependency annotation, annotators were advised to follow links where necessary for their decision.
We base our consolidation procedure on the three categories of agreement presented in Table 2. We use a) a majority vote for the opinion and contextual dependency annotations and b) the average (rounded to the nearest integer) for the strength annotation for the first two rows (unanimous and two different opinions). Cases where three different options were selected by the annotators were consolidated manually by a fourth person who had not previously been involved in the annotation. For further information on the resulting data set, cp. Section 4 below.

DATA SET DESCRIPTION
The gold standard obtained according to the method described in Section 3 above consists of a total of 2000 tweets. The distribution of sentiment annotations can be seen in Table 3.

Table 3: Distribution of sentiment annotations
The great number of "leave" tweets in our data set reflects the overall tweeting behaviour, as also identified by [7]. Very few tweets display an "undecided" sentiment, in line with observations by [1] that strong opinions predominate on Twitter. Our data display a similar bi-modal distribution if we only consider "stay", "leave" and "undecided". Note that the distribution of sentiment labels in our gold standard does not make any statement about the referendum outcome.
An example of a strong "stay" tweet is given in Example (1a) and a strong "leave" tweet in Example (1b).
(1a) @GeorgeKaburu LOL -how might you think that? Against Brexit is an UNDERSTATEMENT I believe it would be disastrous for the UK and Europe.

(1b) @David_Cameron aren't you listening WE DON'T WANT THESE PEOPLE IN UK #VoteLeave @vote_leave @NoThanksEU @leavehq https://t.co/wNmrFqQNFW
The low percentage of irrelevant tweets shows the usefulness of our tracking keywords in retrieving content which is relevant to Brexit. Many of these irrelevant tweets are in languages other than English. Some hashtags are used ambiguously, such as in Example (2), where "#takecontrol" is used in the context of yoga rather than the "Leave" campaign, who coined the phrase.
(3a) Will the rights to travel, live and work across the #EU change if there's a #Brexit? https://t.co/PJSu1IGQg1 #ukemplaw (3b) Eagerly anticipating our #EUreferendum debate later today! Table 4 shows the distribution of opinion strength annotations in our data. We can see that there is a greater tendency for "leave" tweets to display strong opinions, while the "stay" opinions tend towards the weaker end of the scale. Both opinions, however, span the entire continuum.

Table 4: Distribution of strength annotations
Finally, 268 tweets (13.4%) were annotated as depending on context, while 1732 tweets (86.6%) were annotated as not depending on context. Table 5 shows a breakdown of context dependence by opinion annotation. We can see that the percentage of context-dependent tweets is rather stable across all opinion categories.

CONCLUSIONS
From a discourse perspective, the Brexit Twitter Sentiment Gold Standard provides a resource for observing the social and discourse dynamics behind the referendum (in contrast to most political corpora which will have as core discourse targets politicians and parties). The majority of the discourse acts present in the corpora can be categorised into 5 main classes: Event announcement (announcement of political events), linked factreference (link to larger factoid textual references), direct fact reference (summarised facts within the tweet), reference to political actors (containing opinions about the main political agents behind the opposing views) and informal statements (humorous or hate references).
Despite the fact that there are short dialogues (replies), the corpus does not contain complex instances of argumentation flows.
A limitation of social media-based analysis studies is that these present only a selective portion of society, since not everyone uses social media. These services are used predominantly by young and politically active people or by individuals with strong political views [1,3]. This could be easily reflected in the Brexit results, where the majority of younger generation (age 18-44) voted to remain as opposed to people over age 45 . Such a result falls in 6 line with the latest United Kingdom social media statistics, such as for Twitter were 72% of the users are between the age of (26% of users).