Privacy preserving sentiment analysis on multiple edge data streams with Apache NiFi

Sentiment analysis, also known as opinion mining, plays a big role in both private and public sector Business Intelligence (BI); it attempts to improve public and customer experience. Nevertheless, de-identified sentiment scores from public social media posts can compromise individual privacy due to their vulnerability to record linkage attacks. Established privacy-preserving methods like k-anonymity, l-diversity and t-closeness are offline models exclusively designed for data at rest. Recently, a number of online anonymization algorithms (CASTLE, SKY, SWAF) have been proposed to complement the functional requirements of streaming applications, but without open-source implementation. In this paper, we present a reusable Apache NiFi dataflow that buffers tweets from multiple edge devices and performs anonymized sentiment analysis in real-time, using randomization. The solution can be easily adapted to suit different scenarios, enabling researchers to deploy custom anonymization algorithms.


I. INTRODUCTION
Online social networks (OSNs) allow users to share their opinions, thoughts, and sentiments, and this wealth of information has attracted the attention of data analysts supporting applications in business, politics, healthcare, and security surveillance. The bulk of current research on social media has traditionally focused on microblogging platforms where social interactions, attitudes, and emotions can be freely monitored and evaluated. However, these platforms offer only coarse controls for users to manage their data. As a result, despite the public nature of users' posts, there is a potential risk that information collected for an intended purpose may be used to achieve other, potentially malevolent goals and that users could be targeted for persecution for their beliefs and opinions. Evidently, despite the data being public in nature, many privacy experts have resigned over privacy and security [1,2,3].
While the content of a text message can be analyzed to reveal information such as the age, gender, and political orientation of individuals [4][5][6] or the general mood of people [7,8], image analysis may uncover even more specific information such as the place where a photo was taken [9]. Network structure can also be exploited to re-identify users even if all other data is anonymized [10]. It is possible that the behavioral signature of a user can be extracted, even in the absence of the content of the users' messages. For example, Twitter API allows users to access a wealth of metadata such as the number of times a message was re-tweeted, liked, replied-to; date of creation of account; or the number of friends, followers, etc. This metadata alone can be used to uniquely identify the users from a massive dataset [11]. Data owners who do not wish to make their dataset open to the public have various anonymization models at their disposals (k-anonymity, l-diversity and t-closeness). Furthermore, with the advent of new technological spectrums like IoT, 5G, edge computing, anonymization of social data has become a crucial issue. Especially, in the case of streaming data, there is an increased need for real-time anonymization.
Although privacy-preservation in data publishing has attracted a lot of attention and several advancements such as kanonymity, l-diversity, and other algorithms have been proposed, most existing work focus on static data tables and not easily extendable to the streaming data. Recently, a few models have been proposed for streaming data anonymization: SKY [12], CASTLE [13], SWAF [14]. Yet this technology currently lacks open-source implementation and cannot be easily adopted by data scientists who do not have the acumen to develop the infrastructure required for their implementation.
In this paper, we tackle the problem of individual and group privacy by anonymizing streaming data such that the anonymized data still can be used for extracting useful insights. We develop an experimental setup for anonymization of streaming data using the Apache NiFi platform for real-time data European Commission grants 770469 and 815362. ingestion and processing in a local network of Raspberry Pi's [15]. The rest of the paper is organized as follows: Section II outlines the closely related work; Section III describes our approach for streaming data anonymization; Section IV details our experimental set up; and Section V presents our conclusions and show near-future work.

II. RELATED WORK
Identification risk in publicly released data has been an active area of research for nearly four decades [16][17][18]. A number of anonymization methods exist to conceal the identifying attributes, allowing only quasi-identifiers (QI) to be available to an adversary. Examples of identifying attributes are passport number or social security number. Other attributes like zip code, birthdate, or job title are QI. While each of these QIs alone cannot identify an individual, a combination of these has a potential to make identification a risk. Furthermore, QIs are also vulnerable to be linked with other datasets that have been already acquired to gain sensitive information. Therefore, most approaches have focused on anonymizing Qis, using several anonymity operations like generalization, suppression, anatomization, permutation, and perturbation.
The k-anonymity [16] property requires that each record in a dataset cannot be distinguished from at least another (k-1) record with respect to the values of quasi-identifiers of dataset after a series of anonymity operations. K-anonymity has been extensively studied in recent years [17]. However, it is still susceptible to linkage attack. L-diversity proposes an effective solution for the attribute linkage attack by reducing the correlation of QI and sensitive attributes [18].
The most common implementations of k-anonymity are achieved with the application of generalization and suppression operations. Generalization is replacing the targeted value of quasi-identifiers with a more general value. Suppression uses special symbols to replace QI value (e.g. *, &, #), and makes the value meaningless. Other operations like anatomization and permutation decrease the correlation between attributes without modifying the original dataset. The goal of perturbation, on the other hand, is to substitute an original value for synthetic data, ensuring the statistical characteristic of the original dataset is unchanged. And due to this property, perturbation (or randomization) is most suitable for accurate data analytics on an anonymized dataset.
In generalization, the goal is to replace the values of the identifying attributes by category indices and in randomization, by adding noise. For example,

Generalization of Age: 21 => [20 -24]
Randomization of Age: 21 => 23 However, methods available for streaming data anonymization [12][13][14] use generalization, which is not useful for accurate sentiment analysis. In this paper, we propose a method that buffers the data in batches and then anonymize it by randomization using a kernel density estimator.

III. OUR APPROACH
In order to overcome the identification risk posed by publicly streaming data, an online real-time anonymization is developed. Consider, as an example, social media message streams arriving at a server for analysis. Even if the source of each stream removes the identifying attributes and the content of the posts, the remaining metadata and additional domain knowledge (e.g., that the streams contain messages from a particular location) may be sufficient to breach the privacy of the users.
In a more realistic scenario, consider the application that builds the public sentiment profile for a global event by analyzing social media messages from designated locations. It is important for the application to know the overall sentiment from each of these locations, but to hide the individuals' shared sentiments about the event. The data collected by the system requires anonymization such that while the anonymized data allows for aggregate evaluations (e.g., sentiment scores), it disallows identification / tracking of individual users.
Essentially, the randomization method works by reconstructing the probability density function using a set of given data points. Two methods for density estimation are tested: histogram [19] and kernel density estimation [20]. Histograms are a discrete representation of numeric data in equal sized bins. Algorithms to create histograms require: width of the bins, and starting and end points of the bins. Because of this, histograms are not smooth and are heavily dependent on the size of the bins and bin boundaries. We overcome these problems by using kernel density estimators (KDEs). Unlike density estimation using histograms, KDEs are non-parametric density estimators.
More formally, kernel density estimation (KDE) has the effect of smoothing out each data point into a resultant smooth region, the shape of which is determined by the kernel function !(#). Then, KDE sums over all these regions to obtain a density estimator. Let % & , … , % ) ∈ +be an independent, identically distributed univariate random sample from an unknown distribution , with a density function -, then KDE can be expressed as where ! is the kernel function and ℎ > 0 is the smoothing bandwidth that controls the amount of smoothing. Also, the condition ∫ !(=)>= = 1 is imposed to ensure that the estimates .(#) integrates to 1. The kernel function ! is non-negative smooth function with a peak at value 0. Most common choice for a multivariate kernel function is Gaussian. In order to apply KDE to estimate the underlying probability density and to use it to generate random samples according to it, we here need a multivariate Kernel density estimator. Furthermore, in order for the data to remain useful for analytics even after randomization, we need to estimate a conditional multivariate density function and use it to sample the anonymized dataset. Specifically, let us consider variables #, @, A and B. Let B be the independent variable and #, @, and A be dependent variables. Say B is location value and #, @, A are demographic variables. Then, in order to anonymize this dataset comprising of these four variables, we need to randomize the values such that the statistical distributions remains identical before and after randomization. This can be achieved by estimating a conditional distribution -(#, @, A|B) = -(#|B)-(@|B, #)-(A|B, #, @) Here, -is the probability density function. However, in the multivariate case, since it involves correlated random variables of non-normal distribution, we need to use forward Rosenblatt transformation, where each , is a cumulative distribution function. Next, from the available input data, we use the multivariate normal kernel to estimate variables R(#, @, A, B), using the maximumlikelihood bandwidth selection. We then sample this distribution to generate the anonymized dataset. The method was implemented in Python3 using scipy.stats, statsmodel.api, and chaospy libraries [20].

IV. EXPERIMENTS AND RESULTS
The experiment aims to simulate a real-life scenario in which a business deploys smart speakers in different reception areas. Each device is able to capture interactions between customer service and customer and determine the overall sentiment score of the customer using speech recognition software. The challenge is to push actionable sentiment analytics to a cloud, enabling Business Intelligence (BI) operations, while making sure that data on the edge device are anonymized and the identity of the customer and employee protected.
To make the scenario more operational, we decided to collect public tweets from three different locations in the UK, using hashtags that reflect users' political affiliation. The intuition is that an adversary is able to compromise individual and group privacy by tracking support or condemnation of Brexit. This is a realistic scenario applicable also to smart city platforms where a hacker gets hold of sentiment scores and tracks users who are easily reidentified via a linking attack.

A. System Architeture
The system architecture resembles an IoT setup in which three smart devices capture textual data and perform localized sentiment analysis and anonymization in real time before pushing data to a BI platform. Our experimental setup consists of one regional node and three edge nodes communicating over HTTP on a local area network. The regional node, running NiFi server and Elasticsearch, receives data from three edges, each one running MiNifi instances. A Mac mini is used to host the regional NiFi and Elasticsearch servers. Each MiNiFi dataflow collects streaming tweets using the Filter Endpoint from three regions in the UK and pushes the data over HTTP to the regional NiFi installation where data is stored in Elasticsearch.

B. Apache NiFi Dataflow
Using NiFi processors, we were able to fetch live streams from the source system (Twitter) into IoT devices deployed in three different locations in our lab ( Figure 1). Particularly, we deployed the MiNiFi dataflows for nine hours and collected 1,758 Brexit-related tweets from 662 unique locations within three predefined bounding boxes of coordinates in the UK. Four classes of sentiments (positive, negative, neutral, and compound) were calculated using the NLTK VADER algorithm [19]. The compound sentiment scores computed in the three edge devices are as follows: 578 tweets with negative polarity, 700 tweets positive polarity, and for 480 tweets the compound polarity summed up to zero, all of which indicate the diversity of opinions captured in our sample.
We used the GetTwitter processor to fetch streaming tweets based on location and keyword criteria. Each incoming flowfile (tweet) is first filtered and run through the VADER sentiment intensity analyzer and anonymization modules before porting the anonymized scores and locations of the tweet into the regional node for ingestion to the Elasticsearch. VADER is a lexicon and rule-based sentiment analysis specifically designed for social media texts. Each word in their lexicon has been assigned a sentiment value indicative of its intensity of positive or negative polarity. The scores 'pos', 'neg', and 'neu' are computed as a normalized sum of the values of words present in the text.
A list of keywords was selected to capture tweets related to Brexit. The three locations (bounding boxes) were selected based on the 2016 UK European Union membership referendum to reflect a broader spectrum of opinions between the supporters of the leave and the remain camps. A separate GetTwitter processor was used to fetch tweets from Sample Endpoint that had been deleted by the user. Tweets which had been removed or deleted by users were then removed from the Elasticsearch using "delete_by_query" POST request served over with a NiFi ExecuteStreamCommand. The algorithmic representation of the MiNifi dataflow that runs locally in each device along with the reusable XML template is available here [19].

C. Anonymization
Because of a lack of space, in Table I, we show only a sample of the original data ported into the regional node from various edge nodes. As described earlier, edge nodes calculate the sentiment scores of each tweet, removing any identifying attributes of the authors of these tweets. However, despite these omissions, a re-identification attack can still put the individual at the risk of being tracked because sentiment scores can be reproduced by an attacker with background knowledge about the system. We overcome this problem by proposing to randomize the values of the sentiment scores using our multivariate conditional kernel density estimator. The results for the chosen sample are shown in Table II.
In order to quantify the information loss, we measured the mean change in sentiment values per location after anonymization. In order to compute this, we use conditional cross entropy (conditional on Location attribute) of the sentiment scores. Measuring unconditional entropy (or KL divergence) of the multivariate data before and after anonymization is unsuitable since we want the conditional distributions preserved. This is required for the target analytics application, say, summarizing overall sentiment about an event per location value. We calculate the information loss before and after anonymization H, conditioned upon the location variable as follows: where G(#|H)and I(#|H)are conditional distributions of variable X given location L. Variable X is positive, negative, neutral, or compound sentiment score of a tweet and N are the total number of tweets. Table III shows the mean squared-error and conditional cross entropy measurement averaged across all locations for (a) naïve method and (b) our method. It is evident from the table that both average conditional cross entropy loss and MSE are less when using our method of anonymization compared to the popular histogram method. This suggests that our method does not incur much information loss, even after randomization of the data, rendering it useful for accurate data analytics aiming for more reliable results. V. CONCLUSION AND FUTURE WORK In this paper, we propose a solution that strengthens individual and group privacy in streaming data analytics using the Apache NiFi platform. Our multivariate kernel density estimator results in low average information loss when compared with the baseline, rendering it useful for accurate data analytics aiming for reliable results. While our experimental set up is currently for Twitter, our system architecture potentially support other applications such as voice data, video surveillance, and others.