Political Tweet Sentiment Analysis for Public Opinion Polling

Public opinion measurement through polling is a classical political analysis task, e.g. for predicting national and local election results. However, polls are expensive to run and their results may be biased primarily due to improper population sampling. In this paper, we propose two innovative methods for employing tweet sentiment analysis results for public opinion polling. Our first method utilizes merely the tweet sentiment analysis results outperforming multiple well-recognised methods. In addition, we introduce a novel hybrid way to estimate electorally results from both public opinion polls and tweets. This method enables more accurate, frequent, and inexpensive public opinion estimation and is used for estimating the result of the 2023 Greek national election. Our method demonstrated lower deviation from the actual election’s results than the conventional public opinion polls, introducing new possibilities for public opinion estimation using social media platforms.


INTRODUCTION
Public opinion consists of concepts, ideas, and statements that seem too abstract to quantify.However, the popularity of certain political entities such as political parties and politicians seem much more easily quantifiable.Let us consider the case of n (political) entities, each having an unknown popularity score p i , i = 1, ..., n.As political voting is a competitive procedure, the political score (essentially the voting intention) represents the percentage of people that would prefer entity (political party or candidate) i from all other entities.The popularity scores of each political entity: p i , i = 1, ..., n is initially unknown.It can be estimated in various ways, e.g. through conventional population sampling and polling (through questioning) or by social media data analysis.Any polling method leads to the popularity scores estimation p = [ p1 , ..., pn ] T that should be as close as possible to the unknown popularity scores p = [p 1 , ..., p n ] T , which can become known only infrequently in special occasions, e.g. through an election/voting procedure.Let P be the total population set and P m , P o be the population subsets of a) people that are politically active in social media and b) peo-ple participating in a public opinion poll.Each member of the set P m produces political texts that, in many cases, refer to the political entities 1, ..., n in question.Social media hashtags can be used to establish an association of text to a political entity.Text sentiment analysis can be used to classify such texts into sentiment classes 'positive', 'neutral', and 'negative' and quantify their respective text (e.g.tweet) numbers a i , b i , c i respectively for each political entity i = 1, ..., n.The data analysis problem at hand is to regress pm from the sentiment dataset S = ( âi , bi , ĉi ), i = 1, ..., n .As sets P, P m , P o differ the respective estimates p, pm , po will be different.Since voting results are too infrequent and traditional polling results are more frequent, we can use po and past measurements of p to estimate the popularity scores from S, without performing new traditional opinion polls.
Conventional polling using set P o has been satisfyingly accurate over the years, achieving rather low polling errors.However, the process of choosing the sample set P o correctly and manually asking political questions has proved to be a costly procedure.To this end, social media sentiment analysis can provide a cheaper and faster alternative solution for estimating the popularity score distribution pm .As more and more people use social media to publicly state their preferences on political topics social media polling provides an opportunity for cheaper and real-time popularity score distribution estimations.In this paper, we proposed two solutions for estimating pm which can be used to estimate public opinion in terms of voting intention.
In the last decade, Twitter and other social media platforms have been widely used as political communication platforms.This urged the scientific community to investigate the idea of generating public opinion and the election results prediction using merely the data posted online (tweets in the case of Twitter).This trend started on a large scale with the US presidential elections of 2016 and showed very promising results [1].The same methods were implemented for other twoparty (n = 2) political systems performing equally well [2,3].Method [2] tried to improve the popularity metric proposed in [4] for predicting the results of the 2017 French presidential election final round.However, the extension of these methods for multi-party (n > 2) elections is not straightforward.Many approaches were used to bridge the gap between twoparty and multi-party elections which resulted in controversial results.Sentiment score was proposed as the ratio of positive and negative messages on a topic in [5].This method has been widely used for two or multi-party election results prediction.The mapping of the actual political landscape for the 2010 UK general election has been studied in [6].This study has explicitly concluded that political party popularity cannot be predicted solely using Twitter data.Similar methods have been implemented in [7].The fact that most methods try to predict general elections through Twitter produced poor results, leading to a hybrid prediction system [8], implementing an election result regression model, whose input comprises several popularity score metrics.This system was trained on conventional opinion poll results by applying the methods described in [9].
A serious research issue in social media polling is data sentiment imbalance.The problem with political comments on social media is that only a few people, who are probably biased, comment positively about a party.Our analysis shows that only 7% of tweets gathered are positive, thus the difficulty of extracting correct election prediction results is significantly increased.Therefore, we propose a novel election result heuristic estimator based primarily on negative tweets.In addition, we propose a novel method for regressing the popularity score distribution using past traditional polls and election results.Our heuristic method exhibited the lowest deviation, compared to other heuristic methods, from the general election results.Furthermore, our hybrid method outperformed not only the existing hybrid methods on literature but also the last public opinion poll of all major Greek polling companies before the general elections.

POLITICAL POPULARITY SCORE ESTIMATION
BASED ON TWEET SENTIMENT ANALYSIS

Heuristic popularity estimation
Without loss of generality, the rest of this paper's estimates refer to the popularity scores p, either using a population from set P m or from P m and P o .The popularity scores p i , i = 1, ..., n can be heuristically estimated from sentimentlabeled political tweets as follows.Firstly, we perform tweet sentiment analysis [10] and automatically tag each political tweet corresponding to a political party (as identified by the tweet hashtags) as positive, neutral, or negative.Let, a, b, c be n -dimension vectors where a i , b i , c i represent the total number of positive, neutral, negative tweets for a political entity (party) i = 1, ..., n.Let vector d = a + b represent the sum of positive and neutral tweet numbers.The heuristic popularity score is formulated: Where, d τ = 2.2.Political popularity score regression from tweet sentiment analysis and past opinion polls

Opinion Poll Trends Regressor (OPTR)
Over the years, heuristic popularity score predictions have shown promising results.However, their prediction accuracy is sub-optimal as neither ground truth nor optimization criteria have been used in their derivation.Therefore, they did not advantage of the recent success of Machine Learning methods.Given this fact, we can handle this accuracy loss by resorting to past conventional public opinion poll results and using them as ground truth data.To this end, we developed a regression model that correlates changes in public opinion polls' results to changes in positive, neutral, and negative counts: a jt , b jt , c jt , j = 1, ..., L, t = 1, ..., Λ for a certain time window before two consecutive opinion polls.Index j indicates the chronological order of the opinion poll and L is the total number of recorded conventional public opinion polls.Essentially, our method is based on the observation that a significant change in the data measured on social media should result in a relevant change in the popularity scores of the entities.Let us now define the input of our model using the difference of the average positive, neutral, and negative counts of a specified time window Λ before two consecutive opinion polls: By using this input on a simple regression model we implement this mapping and estimate the change r in popularity scores: where, x = ã|| b||c is the input vector of size 3n, f is the model regression function, parameterized by w.Variables k and l indicate two consecutive conventional polls.The training procedure is conducted using the Mean Squared Error (MSE) loss.
Once the regression model ( 6) is trained on sample data D = {p ok − pol , x kl }, k, l = 1, ..., L, it can estimate the next popularity score estimate p by adding r to the previous measurement: t indicates the specific time spot being calculated (usually days).Variable k can be both k = l + 1 or k = l − 1 for data augmentation reasons, assuming that the opposite change in the social media statistics would bring the exact opposite change to the conventional polls' results.
To counter the bias introduced by conventional public opinion polls, instead of using their results directly pok , we choose to utilize the difference of two consecutive opinion polls pok − pol , k = l ± 1. OPTR is similar to the method proposed in [8].However, the latter's regression input consists of features produced from heuristic estimators and its output utilizes the actual opinion poll estimations, differing significantly from our method.Thus, their results fail to analyze the components of the political system (parties) as dependent entities and do not filter this bias added from opinion polls.Regarding the proposed method, new estimations will be given according to the previous ones, as (7) indicates.This creates the need for initial values that can be collected from actual election results, because of the bias introduced from conventional public opinion polls.

Opinion poll grouping to be used in the regression model
The above-mentioned political popularity score estimation can be sensitive to the unavoidable variations observed between various conventional public opinion estimations conducted by different companies.The chosen approach is to group different polls, that were conducted on the same period, according to their deviation from the estimates of other polling companies, to be used in the regression model (6).Let us suppose that u zi (t k ) is the estimation of the popularity score of party i in a poll conducted by company z = 1, ..., m at date t k .As the poll dates differ across polling companies, we perform linear interpolation for each political entity between two consecutive polls conducted by the same company for a given date t, where t k ≤ t ≤ t k+1 , by using the formula: Then we can compute the sum of Mean Absolute Errors (MAE) e ζ (t), between the polls of the company ζ and the polls of the other polling companies, on date t: where, z = 1, ..., m.The variations between public opinion estimations by different companies in the same period are also causing problems for our regressor.To this end, when two or more public opinion polls were held less than d days apart, they were merged using the weighted average according to their respective errors.The hyperparameter d is set depending on the specifications of the opinion polling problem.

Data Gathering
A total of 1,001,836 tweets have been gathered about six Greek political parliamentary parties, using the Twitter API from the 25 th June 2022 until the 25 th June 2023.All tweets have been labeled as neutral, positive, or negative using the Transformer method proposed in [11], which exhibits 79% sentiment recognition accuracy, tested on ground truth Greek political tweets [12].During the data gathering period, we managed to collect 35 public opinion polls, that we utilized for training our regression model ( 6) and also validating our proposed techniques.The Greek general elections were held on 21/5/2023 and 25/6/2023, according to the provisions of the Greek constitution.

Comparison of heuristic political popularity score estimators
To evaluate and compare our estimator (PPSE) with five different heuristic estimators and [2,4,5,8,13], that were either proposed as estimators or used as features in regression models, on political Greek Twitter data.To this end, we calculated the popularity score, according to the aforementioned estimators, for every Greek parliamentary party during the data collection period.Then, to compare the different heuristic estimator outputs we calculated the Mean Absolute Error (MAE), defined as the average deviation of each estimator and the general election results (used as ground truth).As some estimators do not sum to 1 for all n entities, we normalised them first: ṕi = pi n i=1 pi . of 100 to 300 days backward, for the two different election dates.Our estimator outperforms other estimators on most of the windows tested.It must be noted here that according to the election results, our heuristic estimator was the only one to correctly predict the actual party ranking (ND > SYRIZA > KINAL > KKE > ELLINIKI LISI > MERA25).However, all the estimators struggle to predict the actual vote shares.This might occur, because of the advantage polls have over Twitter, of picking a balanced sample of different society groups.Hence, for all practical purposes, it is best to use the proposed OPTR method, which provides superior election result prediction, as analyzed in the next section.

OPTR model evaluation
Since the data collection started in June 2022 we could not access either the previous election's poll results or the tweets of the respective periods.Thankfully, general elections were held twice, allowing us to test our method.Essentially, our method used the results of the elections of 21/5/2023 as initial values and calculated the popularity score changes until the second election of 25/6/2023.Our method is compared with different poll companies, and the results of method [8].Table 2 presents the estimations for our proposed method, [8] method and the last recorded opinion poll of each company before the election date MAE from the actual election results of 25/6/2023.As seen, opinion poll regressor (OPTR) outperforms the technique proposed in [8] and all the conventional opinion polls held in the last two weeks before the election date.OPTR only trained with the noisy opinion polls before the first election date (21/5/2023), which were biased as proved by the results.Although our technique learned the political trends from those noisy samples, their combination with the first election's result surpasses all other techniques and opinion polls, without using any of the polls published after 21/5/2023.Figure 1 presents the error of each company from the others until the first election date as calculated from (9).This error agrees with Table 2, which is an additional validation for using the deviation between different polling companies to eliminate unwanted variations in training data.

CONCLUSION
In this paper, we proposed two new methods for estimating political popularity scores through sentiment analysis of social media data: both a heuristic and a regression method are proposed.They both provide rather good estimation of political popularity scores.The regression-based method is more accurate than the heuristic one but requires knowledge of past opinion poll data and past election results as well.Although the difference between heuristic popularity estimators and hybrid (using social media and past conventional polls) ones is still considerable, as Natural Language Processing (NLP) tools get more advanced the results we get from political forecasting through social media should become more and more accurate, but for the time being hybrid regression techniques outperform them.As indicated by our experiments though, a hybrid method using both Twitter data and opinion polls proved to provide better results than conventional opinion polling companies.This paper proposes an innovative approach to political analysis by leveraging social media as a primary data source.The results of our study demonstrate that this approach outperforms traditional public opinion polls, marking a significant advancement in the field.

ACKNOWLEDGEMENT
The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 951911 (AI4Media).This publication reflects only the authors' views.The European Commission is not responsible for any use that may be made of the information it contains.
n i=1 d i and c τ = n i=1 c i .Essentially, pi (c, d) distributes the total negative tweet count (without the ones of party i) according to each party's own positive and neutral comment counts.As the popularity score distribution should satisfy n i=1 pi (c, d) = 1, we modify this heuristic estimator accordingly and introduce the Political Popularity Score Estimator (PPSE):

Fig. 1 .
Fig. 1.Poll error (e), from other public opinion polls, throughout the number of days since the 14 th of June.Dots indicate the last day when each poll was conducted.

Table 1
presents the testing results during 3 periods, starting from 21 May 2023 and 25 June 2023 for a time window