User Identity Linkage in Social Media Using Linguistic and Social Interaction Features

Social media users often hold several accounts in their effort to multiply the spread of their thoughts, ideas, and viewpoints. In the particular case of objectionable content, users tend to create multiple accounts to bypass the combating measures enforced by social media platforms and thus retain their online identity even if some of their accounts are suspended. User identity linkage aims to reveal social media accounts likely to belong to the same natural person so as to prevent the spread of abusive/illegal activities. To this end, this work proposes a machine learning-based detection model, which uses multiple attributes of users' online activity in order to identify whether two or more virtual identities belong to the same real natural person. The models efficacy is demonstrated on two cases on abusive and terrorism-related Twitter content.


INTRODUCTION
In its somewhat more than 20 years of existence, social media have become an integral part of the life of more than 2.6 people around the globe. Originally envisaged as a means to stay connected with friends, get informed, or be entertained, it has become a very powerful instrument for public opinion formation and dissemination of all kinds of not always harmless content. Particularly worrying is the spread of abusive, extremist, and terrorism-related content via widely used online social platforms, such as Twitter and Facebook. In order to address this problem, social media administrators implement filtering methods and suspend accounts once harmful content is detected [41].
However, to counter such measures and overcome the suspension policies, users seeking to widely disseminate deleterious material often follow various strategies, the most popular being the setting up of multiple (back-up) accounts that allow them to keep contact with individuals with the same disposition (e.g., violent extremists) and exchange content, even after one of their accounts gets suspended [7,18]. It is thus of paramount importance to be able to detect user accounts (alias user identities) likely to belong to the same person, so as to stop the propagation of harmful behavior on a large scale, including the spread of abusive or terrorism-related material. 1 User identity linkage (i.e., detection of multiple user identities) has been studied both across social networks (e.g., [21,34]) and within the same social network (e.g., [15,40]). This paper focuses on the latter case and, particularly, on Twitter. Twitter has been selected as it is one of the most popular social media platforms and often contains abusive [2,4] or terrorism-related [6,10] material. Moreover, Twitter is a rather challenging platform for investigating this phenomenon, since tweets are short and often contain grammatical and orthographic errors, thus making it harder to use off-the-shelf natural language processing tools to analyze them in the context of such investigations. As a consequence, Twitter is often avoided as a single social media source for the study of user identity linkage. Furthermore, user identity linkage research has thus far been mainly conducted on English data sources. Since the dissemination of deleterious (e.g., abusive and terrorism-related) material is not limited to English, the consideration of other languages is also necessary.
Overview & Contributions. In this paper, we design, implement, and evaluate a methodology geared to identify the linkage between online user accounts within the same social network. Specifically, this work proposes a framework which considers a wide range of profile, linguistic, activity, and network characteristics (the latter two are also referred to as social interaction features) for representing users' online presence, and employs machine learning and deep learning-based classifiers for identifying accounts potentially linked to the same natural person. Our main contributions can be summarized as follows: to the best of our knowledge, this is the first user identity linkage work to employ (i) a wide range of features extracted from social networks constructed based on users' activity, (ii) advanced syntactic features based on dependency trees, (iii) semantic similarities based on word embeddings, and (iv) deep neural networks in such a classification setup. Moreover, comprehensive evaluation experiments are performed on two Twitter datasets related to abusive behaviors and terrorism phenomena, with English and Arabic material, respectively, and the experimental results are promising, achieving up to 99.50% AUC.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 presents the proposed framework, the extracted features, and the techniques for modeling the data, and predicting possible user linkage. Section 4 describes the employed datasets, the process for constructing the ground truth, and the experimental methodology, while Section 5 presents the experimental results. Finally, Section 6 draws some conclusions and outlines future work.

RELATED WORK
Numerous studies have examined user identity linkage across online social networks; see, e.g., [21,22,34]. Malhotra et al. [22] proposed to disambiguate profiles of the same user based on their digital footprint in both Twitter and LinkedIn. Twitter has also been jointly considered in many works as one of the studied platforms in relation to other social networks, e.g., Yelp [11], Flickr [11], Foursquare [34], Instagram [34], and Facebook [21]. For instance, authors in [34] proposed a method that examines whether two accounts belong to the same mobile user by exploiting location information, when they are active on both Twitter and Instagram.
Identity linkage within a single social network has also been explored. For instance, an Irish forum was studied [15] to first unmask authors identities and then detect matching aliases. The socalled 'sockpuppetry' (i.e., blocked users initiating new accounts) has been considerably studied on Wikipedia [37,40]. Finally, user identity linkage has been explored on popular online news sites, such as The Guardian and the SPIEGEL ONLINE, to assist their providers detect manipulations of public opinion [32].
Profile, content, and network attributes are often exploited to build such detection models. User name, screen name, and biography are common profile attributes [12,27]. In relation to the posted content, temporal (e.g., timestamps) and spatial (e.g., geotags) information [15,31,34], as well as stylometric features (e.g., part-of-speech n-grams, etc.) [15,32,37] are widely employed. The way that a user's social network is formulated and their communication patterns can also provide useful information about a user's identity; hence, network attributes have been used to detect actor's identity across multiple social networks [20,31]. For instance, a user's immediate or non-immediate neighborhood can be exploited by considering friendship relations.
Building upon such features, supervised, unsupervised, and semisupervised methods have been considered. For instance, a probabilistic classification based on Naive Bayes has been employed to link user identities across social media [43]. Decision Trees, SVM, and kNN algorithms have also been tested [22]. Moreover, an alignment algorithm has also been used, where an affinity score based on timestamped sparse and dense location-based properties is computed to find the most likely matching identities using a maximum weighted matching scheme [34]. Regarding semi-supervised models, a multi-objective framework has been built for modeling heterogeneous behaviors and structural consistency maximization [21]. Table 1 compares our method to those that are most relevant to our problem setting (i.e., identity linkage within the same platform). Most of such works use "classic" (traditional) machine learning classifiers, such as SVMs [15,37,40], Naive Bayes [15], and Random Forest [19,40]. Moreover, matching approaches based on similarity measures (e.g., cosine similarity or euclidean distance) [14], as well as threshold-based approaches have also been employed [32]. Under the features category three main types of features are listed, i.e., activity-, linguistic-, and network-based. Depending on the considered platform, different activity-based features are used, such as number of posts and replies, down-and up-votes, number of total revisions, etc. Moreover, users' activity is often examined in relation to the temporal dimension, by considering for instance the mean time between two consecutive posts or the posting activity  [15] x x x x x [14] x x x x [37] x x x x x [40] x x [19] x in relation to different timeframes (such as hours, period of day, and month). The linguistic-based features are highly related to a user's behavioral and writing style, as for instance average words length, average number of characters per word and/or sentence, upper-cased letters, and part-of-speech tags (such as verbs, nouns, and adverbs). Finally, the network-based features so far have been related to a reply-based network [19], examining users' tendency to cluster with others (based on clustering coefficient) and quantifying the extent to which users reciprocate the reply communication they receive from other users (reciprocity). Overall, apart from English, Irish [14,15] and German [32] textual sources have been studied.
Contributions. Compared to existing works, we use a wide range of linguistic features (driven by well-established approaches used in similar tasks, e.g., author profiling and identification), while to our knowledge we are the first to employ dependency and tree features in addition to part-of-speech (as syntactic features) in this context. Moreover, we advance state-of-the-art by considering various social interaction features, which contribute significantly in successfully detecting accounts likely to belong to the same person within a social network. Specifically, we employ a "conversationbased network", which considers mentions, replies, and retweets, to first construct the network and estimate then various network features. To the best of our knowledge, we are the first to employ the conversation-based network and all these features in this context. To be in alignment with the literature, we evaluate various traditional machine learning methods, i.e., probabilistic, tree-based, and ensemble classifiers. In addition, we study the application of deep learning on the user identity linkage task. The designed neural network architecture digests both textual information and various numerical metadata (i.e., activity, linguistic, and network features). Finally, since the propagation of objectionable material is not limited to English, we conduct comprehensive experiments in two case studies related to abusive and terrorism phenomena, associated with English and Arabic textual sources, respectively.

DISCOVERY OF ACCOUNT LINKAGE
This section details the proposed framework for detecting the possible linkage of user accounts in social media based on models of user behavior. To this end, a wide range of user characteristics are considered for representing users' online presence, and, based on these extracted features, machine learning and deep learning-based classifiers are employed for distinguishing between linked accounts (i.e., accounts belonging to the same person) and non-linked accounts.

Individual User Account Features
Various attributes can be exploited in social media to model the behavior of each individual user, namely: (1) Profile Features (P) extracted from a user's profile, such as demographic information, biography, avatar (i.e., image provided by the user to visually present themselves), etc. Below, we detail the set of features considered per individual user account for each of the aforementioned categories.
Profile Features. Features in this category include the age of the account (i.e., number of days since its creation), whether the account is verified or not (i.e., acknowledged by Twitter as an account linked to a user of "public interest'), and whether or not the user has provided information about their location.
Activity Features. These features provide an overview of a user's online presence with respect to the considered social network and include the number of: posts, lists subscribed to, shares, favorited tweets, mentions, and hashtags, as well as the posts' inter-arrival time. For instance, mentions can be used to directly interact with another user (and possibly perform direct attacks in an abusive context), while the use of hashtags (particularly of popular ones) is a way to increase a post's visibility.
Linguistic Features. This set of features analyzes the writing style of the author of a tweet. Based on the posted content, surfaceoriented and deeper stylistic features are extracted. In particular, five subcategories of features are considered [35], as described next. 1. Character-based features: ratio of the number of each of the following characters to the total number of characters: upper-cased, periods, commas, parentheses, exclamations, colons, number digits, semicolons, hyphens, and quotation marks.
2. Word-based features: mean number of characters per word, vocabulary richness (i.e., different words being used), acronyms, stopwords, first person pronouns, usage of words composed by two or three characters, standard deviation (STD) of word length, and the difference between the longest and shortest words.
3. Sentence-based features: mean number and standard deviation of words per sentence, and difference between the maximum and minimum number of words per sentence in a text.
4. Dictionary-based features: the ratio of each of the following types of tokens to the total number of words in a text: discourse markers, interjections, abbreviations, curse words, and polar (positive/negative) words [13].
5. Syntactic features: three types of syntactic features are taken into account: (i) Part-of-Speech (POS) features: relative frequency of each POS tag in a text; (ii) Dependency features: occurrence of syntactic dependency relations in the dependency trees of the text; 2 to this end, we extract the frequency of each individual dependency relation per sentence, the usage ratio of the passive voice, and the number of coordinate/subordinate clauses per sentence; and (iii) Tree features: measures of the tree width, the tree depth, and the ramification factor, where tree depth is defined as the maximum number of nodes between the root and a leaf node, tree width is the maximum number of siblings at any of levels of the tree, and the ramification factor is the mean number of children per level; in other words, the tree features characterize the complexity of the inner structure of the sentences (simple clauses, as well as subordinate and coordinate clauses). To extract syntactic features, the parser presented in [24] has been trained on English and Arabic material annotated with Universal Dependencies.
Network Features. This feature category aims to measure the popularity of a user based on different criteria, such as the number of followers (in-degree centrality), friends (out-degree centrality), and their ratio; since Twitter allows users to follow anyone without their approval, this ratio can quantify a user's popularity. Overall, these measures can quantify a user's opportunity to have a positive or negative impact in their ego-network in a direct way.
To dig deeper into users' relations, we construct a "conversationbased network" based on the mentions, replies, and retweets between each pair of users, and extract (using Gephi [9]) six network features grouped as follows: (i) Distribution metrics: hub, authority, Eigenvector, and PageRank centralities, which measure users' influence and connectivity in their immediate and extended neighborhoods, (ii) Connection metric: number of triangles a node belongs to, and (iii) Segmentation metric: Clustering Coefficient, which shows a user's tendency to cluster with others. To the best of our knowledge, we are the first to employ the conversation-based network and all these features in this context.

User Modeling
The aforementioned feature categories (or sets) ={ , , , } can be exploited to model the behavior of each individual user account in a social media platform. We thus define the feature vector for each user and feature category as is the th feature of category for user , and equals to the total number of included features for this category. For instance, for the network features category, a feature vector can be created for every as follows: , ℎ >. A feature vector can also be created by considering all features from all four sets.
To detect whether two accounts are likely to belong to the same person, we also need to jointly represent each user pair so as to determine their potential relationship and use that as input to the classifier. To this end, we jointly represent the behavior of each pair of users and , ∀ , , where ≠ , as either (i) a feature vector of the absolute differences between the individual feature vectors of and , or (ii) as a vector of four similarity scores, each estimated based on the similarity of the per-category { , , , } feature vector. To estimate these similarities, the cosine similarity, the Euclidean, and the Manhattan distance are used; for the latter two, normalization is applied, such that values ∈ [0, 1].
Apart from the above approaches to user pair modeling that take into account the extracted features, we can also measure the direct similarity of the evidence associated with each user, such as their posted content, social network, and profile. In particular, we focus on the similarity between the posts of two users, since users tend to express themselves in standard ways by frequently using the same words or expressions; moreover, due to daily social interactions, even different persons may result in using the same words in essentially the same way [1]. We thus consider two additional features corresponding to the similarities between the posts of two users, measured in terms of their (i) edit distance, i.e., number of changes needed to convert a text to another, and (ii) semantic similarity. To this end, a preprocessing step is applied to remove all numbers, mentions, and URLs from the posts.
Edit distance is estimated with the Levenshtein distance [30], which counts the minimum number of single-character edits needed to convert one string into another; for each pair of users, this is averaged out over all pairs of their posts. Semantic similarity is estimated based on a vector space model approach, whereby each word in a post is represented as a word embeddings vector. Word embeddings allow modeling both semantic and syntactic relations of words, thus capturing more refined attributes and contextual cues inherent in language. Specifically, we use Word2Vec [25] to: (1) first establish a vocabulary based on the words included in the set more times than a user-defined threshold, (2) apply a learning model so as to learn the words' vector representations in a -dimensional space (50-300 dimensions can model hundreds of millions of words with high accuracy [25]), and (3) output a vector representation for each word encountered in the input texts. Based on [25]

Classification
To be in alignment with the state-of-the-art, here, we proceed with both traditional machine learning methods and deep neural networks (NNs). Regarding the former, probabilistic (e.g., Naive Bayes, BayesNet), tree-based (e.g., J48, LADTree, LMT), and ensemble classifiers are considered. As an ensemble classifier, we use Random Forest which constructs a forest of decision trees with random subsets of features during classification; an important advantage is its ability to reduce overfitting by averaging several trees during model construction. Moreover, Random Forests are quite efficient in terms of the time needed to train a model. To build the Random Forest classifier, we tune the number of generated trees to 100, while there is no limit set to the maximum depth. Even though the traditional machine learning approaches have been extensively used in similar tasks, they face an important drawback: they cannot successfully combine semantic and cultural nuances of the written language. For instance, taking into account the negation of words or sarcastic expressions with traditional machine learning approaches is a quite challenging task, as the structure of the sentence has to be effectively presented in the set of features. To overcome such difficulties, deep learning algorithms have been proposed that build upon neural networks. Therefore, here we also proceed with a modeling process building upon neural networks. Specifically, in the neural network setup, we build a model to combine raw text with metadata (i.e., profile, activity, linguistic, network, and user pair features), similar to [8]. The combination of raw text with additional behavioral facts (such as users' popularity, social network, and account settings) allows us to capture different facets of users' behavior, and thus possibly detecting more efficiently accounts likely to belong to the same user. Specifically, we construct a single network architecture which combines both text classification and metadata networks (see below) before their inputs are translated into classification probabilities. Figure 1 depicts the deep neural network setup used in this work.
Text Classification Network. We employ a Recurrent Neural Network (RNN) [26], which processes sequential data using recurrent connections between their neural activations at consecutive time steps. RNNs were selected over other NN models since they have proven successful in understanding word sequences and interpreting their meaning. Specifically, we build upon a Gated Recurrent Unit (GRU) since it performs well on short texts (such as tweets) [8]. We employ a GRU with 100 units (neurons); we experimented with different sizes and this gave the best results for both datasets. To avoid over-fitting, we use a recurrent dropout with = 0.5. Before moving through the RNN layers, the first layer performs a word embedding lookup, where all words are represented as highdimensional vectors. For English, we use pre-trained word vectors from Twitter [33]; for Arabic, we use AraVec [36], a pre-trained distributed word representation. Tweets' words are mapped onto 200 and 300 dimensional vectors, for English and Arabic, respectively.
Metadata Network. After feeding the data to the metadata neural network, a batch normalization layer is used to enable faster learning and higher overall accuracy. To learn the metadata, we use a simple dense layer with 100 units, i.e., the same dimensionality as the text classification network. Finally, we use tanh as activation function, since it performs well with standardized numerical data.
Combined Network. We combine the text classification and metadata networks using a concatenation layer using a fully connected output layer (i.e., dense layer) with one neuron per class we want to predict and softmax as activation function.

EXPERIMENTS
This section presents our evaluation experiments on abusive and terrorism-related datasets collected from Twitter.

Datasets
The first step is to collect the necessary content from Twitter, i.e., one of the most popular social networks with ∼330 monthly active users [38], which also gives access to an important number of sample tweets via its open API. For our study, two datasets obtained from Twitter are used; we focus on these datasets since they are likely to involve users with multiple accounts [7,18,42]. It should be noted that the collected data correspond to publicly available data, we did not attempt to de-anonymize users, and we fully comply with the terms of use of the APIs we use.
Abusive Dataset. The dataset provided by [5] was used for studying abusive activities on Twitter. The authors collected a set of tweets between June and August 2016, using snowball sampling around the GamerGate controversy [23], which is known to have produced many instances of cyber-bullying and cyber-aggression. GamerGate originated from alleged improprieties in video game journalism, which quickly grew into a larger campaign centered around sexism and social justice. The GamerGate controversy, and more specifically the hashtag #GamerGate, can serve as a relatively unambiguous reference to posts that are likely to involve abusive/aggressive behavior from a fairly mature and hateful online community, since individuals on both sides of the controversy were using this hashtag. Moreover, extreme cases of bullying and aggressive behavior (e.g., direct threats of rape and murder) have been associated with it. Overall, the dataset consists of 600 tweets in English and 312 users.
Terrorism Dataset. This dataset was created using Twitter's Search API, which returns tweets matching specified keywords. Specifically, we collected data from February 2017 to June 2018 using a set of terrorism-related Arabic keywords provided by Law Enforcement and domain experts. The dataset consists of 65 tweets and 35 users. Based on a language detection library [29], 99% of the posts in our dataset are in Arabic.

Ground Truth
Due to the absence of ground truth that indicates which user accounts belong to the same person, the ground truth for each dataset is created as follows. First, we filter out all users with less than 10 posts (thus removing all users associated with insufficient evidence), and then we randomly select a subset of user accounts (e.g., =200 users) by applying a stratified random sampling. To this end, the entire population is first divided into homogeneous groups based on the number of posted tweets; this number is varied between 10 and 60 with step 5, while the final group contains all users with more than 60 posts. Then a random sample is selected from each group, with the sample size being proportional to the group's size compared to the entire population. As in [15,32], where no annotated datasets were available, we build the ground truth by splitting the posts of each selected user into two subsets, assigning to each subset a different user id (e.g., user becomes and , and the tweets of are split between and ). Thus, we come up with a dataset with the double number of user accounts (e.g., 400 users for =200) and a set of known linked accounts (i.e., accounts belonging to the same person). Two approaches are considered for splitting the tweets of the original accounts (e.g., ) into linked users (e.g., and ): (i) random assignment of an equal number of posts to each, and (ii) interleaving, where posts are initially sorted based on their timestamps and then alternately assigned to each of the linked accounts.
Hence, we have two sets of users available: ={ 1 , 2 , . . . , } and ={ 1 , 2 , . . . , }. Comparing each user from set A with each user in set B ∀ , , where ≠ , we result to overall = * ( − 1) user pairs (e.g., for = 200, = 39, 800), with each user pair in corresponding to a non-linked account. For each dataset, we opt for maintaining a proportion of 10% of linked and 90% of non-linked accounts, given that previous works, e.g., [16], have indicated that about 10% of users within a dataset tend to exhibit bad behavior. Therefore, for a given , we randomly sample from so as to reflect the above observation; e.g, for =200, the final dataset contains 200 linked accounts ( , ) and = 9 × 200 = 1, 800 non-linked accounts ( , ), ≠ . We also (i) vary the number of randomly selected users from 200 to 500 in steps of 100, and (ii) create unbalanced datasets by increasing the non-linked accounts; for this, we keep the same number of linked accounts and incrementally increase the number of non-linked accounts with step 9 × . E.g., for =200, ranges from 1, 800 to 39, 800 with step 9 × 200 = 1, 800. In the last step, we consider all 39, 800 (rather than the 39, 600) non-linked accounts.

Features Selection
Section 3.1 described various features that could be considered for exploring whether two accounts belong to the same person. Given the ground truth creation process applied in this work, profile features, as well as the number of followers, friends, and their ratio of the network features are excluded (as they would be the same for both linked accounts), while for activity features (for the same  reason) we can only consider the number of mentions and hashtags, and the posts' inter-arrival time. Table 2 summarizes the examined features; in real scenarios, all features from the four categories could be considered and may be beneficial for the classification. As expected, some of the features in Table 2 could be more distinguishing and thus assist more the classification. To this end and towards feature selection, we examine the significance of differences between the distributions of linked and non-linked user accounts based on the two-sample Kolmogorov-Smirnov test. This test is used since it enables to assess whether two samples come from the same distribution based on their empirical distribution function (ECDF). We consider as statistically significant all cases with <0.01. Due to space limits, we only present the ECDF plots of some features; to improve readability, some plots are trimmed.
Activity Features. Figures 2a-2b plot the ECDF for the number of mentions and hashtags for the linked and non-linked users ( <0.01). We observe that the non-linked users tend to have a higher difference in relation to the number of mentions and hashtags compared to the linked user accounts. As for the inter-arrival time between the posted tweets (not shown in the plots), the difference is also statistically significant ( =0.15849). Linguistic Features. To identify the linkage of two or more accounts we consider a set of various linguistic attributes extracted from the available textual material. Driven by the author profiling and identification tasks, we assume that the writing style of an author is unique enough to be distinguishable from the style of other authors [35]. In the literature for author profiling and identification a wide range of features is utilized; for instance, Burger et. al [3] use more than 15 attributes, while Mukherjee and Liu more than 1 [28]. For our purposes, a more limited number of linguistic features is exploited, which has been shown to perform well in similar tasks [35]. This set of linguistic features is generic enough to capture the complexity and style of the discourse across different language families. Indicatively, Figures 2c-2f depict the ECDFs for the frequency of verbs, nouns, mean number of characters per word, and upper-cased characters features. Comparing the distributions among the linked and non-linked accounts, we observe that the differences are statistically significant with =0.25181, =0.29595, =00.30405, and =0.29209, respectively. Overall, in an effort to detect the linkage among users with the maximum possible efficiency we consider all the linguistic features presented in Table 2 (the difference in their distributions is statistically significant).
Note. The analysis presented thus far was conducted on the English (abusive) dataset. A similar analysis was conducted for the Arabic (terrorism-related) dataset; we omit the results due to space limits.
Features Evaluation. Table 3 shows the top 12 features for both the abusive and terrorism datasets based on the information gain approach which ranks features based on the information gain entropy in decreasing order. We observe that in both cases the network features, which describe the connectivity of users in the network, are among the most contributing ones. Especially for the abusive dataset such features seems to occupy the first places. Regarding the activity features the average number of mentions is among the top contributing ones in both cases, where especially for the terrorismrelated dataset both the average number of hashtags and mentions seem to have a better discriminative ability comparing to the rest. Focusing on the abusive dataset and the linguistic features, we observe that four out of seven are syntactic-based which indicates the importance of such features in distinguishing between linked and non-linked accounts. Specifically, the most contributing syntacticbased features are the following: adverbs (part-of-speech), adverbial modifier (adverb or adverbial phrase that serves to modify a predicate or a modifier word), passive nominal subject (a noun phrase which is the syntactic subject of a passive clause), and coordination (is the relation between an element of a conjunct and the coordinating conjunction word of the conjunct). With respect to the terrorism dataset and the linguistic features, we observe that the character-, word-, and syntactic-based ones tend to have an important discriminating power with the average number of punctuations and the difference between the longest and shortest words features being among the most contributing ones.
Overall, for the English (abusive) dataset, most of the features presented in Table 2 are useful (statistically significant) in discriminating between the two classes (i.e., linked and non-linked user accounts). However, some are not useful and are excluded to avoid adding noise. Specifically, two features are excluded: the number of triangles and the clustering coefficient. For the Arabic dataset, all features are useful and thus are used during the modeling analysis.

Experimental Methodology
The features from the three categories { , , } that are selected as described above are employed for user modeling, while user pairs are modeled based both on the absolute difference (abs) and on the similarity of feature vectors (sim); similarity is estimated based on Cosine similarity, and Euclidean and Manhattan distances. Therefore, the following approaches are evaluated: , , , , and . Moreover, the concatenation of and is also considered. In addition, the two features derived by modeling each pair of users using the edit distance and semantic similarities (see Section 3.2) are considered in conjunction with the above, resulting in five additional approaches (see Table 4). Overall, a total of 11 different methods are evaluated.
We examined various machine learning algorithms, either probabilistic, tree-based, or ensemble classifiers, as well as deep neural networks. For each family of classifiers, we only present those that achieve the best results (due to space limits). Specifically, BayesNet, J48, and Random Forest (RF) are used as probabilistic, tree-based, and ensemble classifiers, respectively, along with the neural network setup. We use WEKA for the traditional classifiers, and Keras with Theano [39] for the deep learning models. In all cases, we use repeated (5 times) 10-fold cross validation which is less variable than the ordinary 10-fold cross validation [17].
Baseline. Among the 11 approaches, the first three (i.e., , , ) are our baselines. Our aim is to not only determine the most effective classification approach, but to also assess whether the consideration of further information in the classification model (i.e., the features combined under different schemes) improves the overall performance, regardless of the choice of the classification algorithm. As shown in Table 1, a wide range of activity, linguistic, and network features have been exploited in previous related research. In an effort to be in alignment and comparable to literature to the maximum extent possible, here, we consider an important number of these features. Specifically, we focus to those that are more applicable to our problem setting, since due to the inherent differences in the structure of the various social media platforms, different features are applicable to each case.
At the same time, we further expand these features to better describe online user behavior. Specifically, as for the linguistic features, we consider both dependency and tree features in addition to other commonly used ones (e.g., part-of-speech). Moreover, a wider range of network features is extracted by building on top of the conversation-based network constructed using mentions, replies, and retweets; previous work has used only a reply-based network and considered only two network features. Finally, to further improve the detection process, we also experiment with different combinations of features and user modeling approaches (i.e., absolute difference and similarity of feature vectors), while at the same time we further enhance the baseline by employing similarity-based features (i.e., edit distance and semantic similarity), which can encapsulate the authors' writing style in greater depth.
Evaluation metrics. To be in alignment with similar works, standard evaluation metrics are reported: precision (prec), recall (rec), weighted area under the ROC curve (AUC), and accuracy (Acc). In each table and for each evaluation metric (i.e., accuracy, AUC, precision, and recall), we highlight the top in terms of performance.

RESULTS
We first evaluate user identity linkage detection on the abusive dataset and then on the terrorism dataset. The results are first presented on datasets built for =200 and =1, 800, and then for varying and values. Moreover, the presented results are based on randomly assigning tweets between linked accounts when building the ground truth; we achieve similar performance with interleaving (we omit these results due to space limits). Table 4 shows that BayesNet achieves the best results when using the absolute difference for the user modeling, with AUC between 74.20% and 98.22% and accuracy between 91.26% and 97.64%. We achieve the best precision and recall with the network features, either on their own (i.e., 97.58% and 97.60%) or combined with the two texts' similarity measures, i.e., edit distance (edits) and semantic similarity (sem), (i.e., 97.58% and 97.64%). With regard to feature categories, the activity ones contribute the least, with 88.94% precision, 91.28% recall, and moderate AUC of 74.20%.

Abusive Dataset (English tweets)
Similar to BayesNet, J48 achieves the best AUC (up to 95.30%) based on the absolute difference between features, while again we achieve the best performance using the network features (i.e., 99.08% precision and 99.10% recall). Finally, texts' similarities appear to have an important role, since in most cases they tend to improve the classification results.    Contrary to the traditional classifiers, the linguistic features perform better in the NN setup (i.e., 91.65% AUC, 94% precision, and 95% recall) compared to the activity and network ones. Overall, we obtain the best performance in terms of AUC (96.13%) when all features are considered, when using both the absolute difference and similarity of features vectors for user modeling. This indicates that the more information as input to the NN, the better the performance.
Finally, the Random Forest ensemble classifier achieves the best performance when network features are used in addition to texts' similarities. Specifically, AUC equals to 99.50% with precision, recall, and accuracy around 97.80%. Compared to the probabilistic, tree-based classifiers, and deep neural networks, the Random Forest model achieves the best AUC, with precision and recall values among the top; thus we use only this in the following experiments.
Thus far, we used the ground truth created with = 200 randomly selected users. Next, we vary from 200 to 500 with step 100. Figure 3a, which depicts the performance of the Random Forest model, shows that from 200 to 300 users there is a slight increase in precision, recall, and accuracy, while then the performance is quite stable with more than 99% AUC in all cases.
We also examine how the number of the non-linked instances (unbalanced dataset) affects the results. The selected number of linked accounts equals to 200, thus the upper limit of non-linked accounts equals to 39, 800. Figure 3b indicates that even with the highest number of non-linked user accounts, AUC remains at quite satisfactory levels (87.30%). Precision and recall increase as more data is available, while after a point (∼24 non-linked accounts) they are not significantly affected. This is mainly attributed to the higher precision and recall values for the non-linked accounts. Hence, even with a higher amount of non-linked accounts, the proposed model will succeed to effectively distinguish between linked and non-linked users. Table 5 shows that when using BayesNet, the linguistic features alone result in better performance compared to the activity and network ones. We achieve the best precision (97.26%) and recall (96.78%) when we consider all feature categories together using both the absolute difference and the similarity of feature vectors for user modeling. AUC maintains above 94%, for all cases, except when only the activity features are considered (81.20% AUC). Contrary to the BayesNet results in the abusive dataset, here we see that when the similarity of feature vectors (combined with additional features) is used as a user modeling method, we achieve high precision and recall values (up to 97.26% and 96.78%, respectively).

Terrorism Dataset (Arabic tweets)
Out of the tree-based classifiers, J48 performs best (similar to the abusive case), following also a similar pattern in terms of the most well-performing feature categories and user modeling methods. Network features appear to contribute more with the best performance (i.e., 97.68% precision, 97.70% recall, 94.16% AUC) achieved when combined with the texts' similarity measures.
Similar to the abusive case, linguistic features contribute more in the NN setup (96.40% AUC, 96% precision and recall) compared to activity and network ones. We obtain the best AUC (98.45%) when all feature categories are considered, in addition to the texts' similarities. In almost all cases, AUC, precision, and recall are higher than 90%, highlighting the stability of the used setup.
Finally, the best performance for the Random Forest (99.50% AUC) is obtained when all features under the absolute difference modeling method are combined with the texts' similarities. Regarding the feature categories, linguistic features result in better performance compared to the rest (98.72% AUC), which is also the case when combined with the texts' similarities (99% AUC). Overall, Random Forest leads to the best AUC and therefore is used next. Figure 4a shows the performance of Random Forest when the number of the selected linked accounts changes. AUC is fairly stable with its value to be in all cases above 99%, which indicates the suitability of the proposed model. Concerning the other evaluation metrics, the increase of the linked accounts results in higher values. Figure 4b depicts how the proposed model performs with an unbalanced dataset (as in the abusive case: 200 linked and up to 39, 800 non-linked accounts). Overall, AUC fluctuates from 94% to 99.50%, which again points out the stability of the proposed model and precision and recall from 97.1% to ≈99%.

Classification Takeaways
Overall, our models perform well for both the abusive and terrorismrelated datasets. For instance, the high ROC area 3 for the overall classification (99.50% in both cases) indicates that the proposed models can quite successfully discriminate between linked and nonlinked accounts. Even though the performance is slightly different in terms of the precision, recall values and the classification models, in both studied cases the traditional classifiers performed better. The lower performance of the neural network model can be justified by the limited number of instances used for building the model, since NNs perform better when large numbers of training data is available. Moreover, in most cases, a better performance is achieved when baseline features are enhanced with additional information.
Focusing on the specific feature categories, we observe that the network features contribute significantly to the classification (especially when traditional classifiers are used); this highlights the importance of considering the connectivity of a user in a network to detect more efficiently the linkage between users.
A quite important observation is that the proposed models perform well in different languages, and the performance, in some cases, is slightly better in the Arabic dataset. This could possibly be attributed to the way that the initial data was collected. The abusive dataset was created based on #Gamergate as a seed word for querying Twitter, while then during the collection process further filtering keywords were added in consecutive time intervals to select additional abusive-related content [5]. On the contrary, the terrorism-related data was collected based on targeted filtering keywords from the very beginning. Hence, the abusive dataset is less focused than the terrorist one, and thus users' behavioral patterns may differ more, making the classification somewhat harder.
Overall, even with more targeted or broader data, the proposed ensemble models succeed in distinguishing quite effectively between linked and non-linked accounts. Moreover, we observe that, for both the abusive and terrorism datasets, the ensemble models built using the network features in addition to the texts' similarity measures result in high performance (AUC > 98% and Acc, Prec, Rec > 97%). Hence, since some linguistic features are languagedependent and thus additional effort would be needed for constructing such models for other languages, one could opt for the network-based model which is easier to adapt to different languages (probably with a slight negative effect on the overall performance).

CONCLUSIONS & FUTURE WORK
Similar to the offline world, user-generated content in online social networks often relates to abusive or even illegal activities. While social media administrators often take intensive actions to remove the content and respective content producers not complying with their rules, users with non-legitimate or abnormal activity often tend to create multiple accounts in an effort to bypass and to be a step ahead of the applied combating measures. This work proposed a framework for detecting accounts likely to belong to the same natural person in an attempt to combat multiple non-legitimate accounts. We considered several attributes of users' online activity, posts, and networks, and traditional machine learning methods, as well as deep neural networks were tested. The results showed that our method is able to effectively detect linked accounts related to non-legitimate, or even illegal (abusive and terrorism-related) activities, in different languages: English and Arabic.
As future work, we plan to conduct our analysis on other online social media platforms, such as YouTube and Facebook, so as to understand if our methods can be easily adapted within and across other social networks. Moreover, the proposed method could be extended to consider additional linguistic attributes, like sarcasm and irony. Finally, we aim to also investigate the effectiveness of our framework in domains amenable to public opinion manipulation and propaganda, such as politics.