Mean Birds: Detecting Aggression and Bullying on Twitter

Over the past few years, bullying and aggression of Internet users have grown on social media platforms, prompting serious consequences to victims of all ages. Recent cyberbullying incidents have even led to teenage suicides, prompted by prolonged and/or coordinated digital harassment. Although this issue affects as many as half of young social media users, tools and technologies for understanding and mitigating it are scarce and mostly ineffective. In this paper, we present a principled and scalable approach to detect bullying and aggressive behavior on Twitter. We propose a robust methodology for extracting text, user, and network-based attributes from a corpus of 1.6M tweets posted over 3 months. We study the properties of cyberbullies and aggressors and what features distinguish them from regular users, along-side crowdsourced labeling provided by human annotators. Finally, we use machine learning classiﬁcation algorithms to detect users exhibiting bullying and aggressive behavior, achieving over 90% AUCROC.


Introduction
Cyberbullying and cyberaggression have unfortunately become serious and extensive issues that affect a large number of individuals worldwide. Both young people and adults face different forms of abuse and harassment. In the physical world, bullying entails repeated and/or coordinated actions involving negative or aggressive interactions, often in the form of threats, damaging rumors spread about the victim, and verbal or physical attacks. Its digital manifestation, aka cyberbullying, differs from it in various aspects, as digital harassment may be carried out by total strangers, and with the potential "protection" of anonymity.
Over the past few years, as social interactions have migrated online, social media have become an integral part of people's every day life. As a consequence, bullying and aggression of Internet users have increased dramatically. Even just a few years ago cyberbullying was not taken seriously, and the typical advice was to "just turn off the screen" or "disconnect" [25], however, as its proliferation and the extent of its consequences reach epidemic levels, it can no longer be disregarded. In 2014, about 50% of young social media users reported being bullied online in various forms. 1 Moreover, today's hyper-connected society allows a phenomenon once limited to particular places or times of the day (e.g., school hours) to occur anytime, anywhere, through a few taps on a keyboard.
Popular social media platforms like Twitter and Facebook are not immune to cyberbullying and aggressive behavior [32]. In fact, racist and sexist attacks on Twitter may have put off the sale of the company [36]. Regardless, there are very few successful efforts to detect abusive behavior on Twitter, both from the research community (see Related Work) and Twitter itself [35]. Arguably, there are several inherent obstacles. First, tweets are short and full of grammar and syntactic flaws, which makes it harder to use natural language processing tools to extract text-based attributes and characterize user interactions. Second, each tweet provides fairly limited context, therefore, taken on its own, an aggressive tweet may be disregarded as normal text, whereas, read along with other tweets, either from the same user or in the context of an aggressive behavior from multiple users, the same tweet could be characterized as bullying. Third, despite extensive work on spam detection in social media [12,34,39], Twitter is still full of spammy accounts [5], often using vulgar language and exhibiting behavior (repeated posts with similar content, mentions, or hashtags) that could also be considered as aggressive or bullying. Filtering out such accounts from actually abusive users may be a difficult task. Finally, aggression and bullying against an individual can be performed in several ways and not only using abusive language -e.g., via constant sarcasm, trolling, etc.
In this paper, we address the problem of detecting cyberbullying and aggression behavior on Twitter. We design and execute a novel methodology geared to label aggressive and bullying behavior of Twitter users. We present a principled and scalable approach for eliciting user, text, and networkbased attributes of Twitter users from a large corpus of 1.6M tweets collected over 3 months, extracting a total of 30 features. We study the properties of cyberbullies and aggressors, and what features distinguish them from regular users, alongside labels provided by human annotators recruited from a popular crowdsourcing platform, namely, CrowdFlower. Our modeling yields some interesting findings, e.g., that bully users are less "popular" and participate in fewer communities, however, when they do become active, they post more frequently, and using fewer hashtags, URLs, etc., than others. Moreover, we show that bully and aggressive users tend to attack, in short bursts, particular users or groups they target. We also find that, although largely ignored in previous work, network-based attributes are actually the most effective features for detecting user aggressive behavior. Finally, we show that our models can be fed to machine learning classification algorithm to effectively detect users exhibiting bullying and aggressive behavior, achieving up to 90.7% AU-CROC, 89.9% precision, and 91.7% recall.
Paper Organization. The rest of the paper is organized as follows. The next section reviews related work, then Section 3 provides a high-level overview of our methodology. In Section 4, we present our dataset and the steps taken for cleaning and preparing it for analysis. Section 5 discusses the features extracted from the dataset and the classes of users we consider, while Section 6 presents the techniques used to model the data and predict user behavior. Finally, the paper concludes in Section 7.

Related Work
We now review previous work on detecting offensive, abusive, aggressive, or bullying content or behavior on social media. We also review methods used for the detection of such behavior.
Detection. Over the past couple of years, a few techniques have been proposed that aim to detect offensive or abusive content/behavior in platforms such as Instagram [16], YouTube [6], Yahoo Finance [10], and Yahoo Answers [18]. More specifically, Chen et al. [6] use both textual and structural features (e.g., ratio of imperative sentences, adjective and adverbs as offensive words) to predict a user's aptitude in producing offensive content in Youtube comments, while Djuric et al. [10] rely on word embeddings to distinguish abusive comments on Yahoo Finance. Nobata et al. [28] perform hate speech detection on Yahoo Finance and News data, using supervised learning classification. Kayes et al. [18] find that users tend to flag abusive content posted on Yahoo Answers in an overwhelmingly correct way (as confirmed by human annotators). Also, some users significantly deviate from community norms, posting a large amount of content that is flagged as abusive. Through careful feature extraction, they also show that it is possible to use machine learning methods to detect users who were eventually suspended.
Dinakar et al. [9] detect cyberbullying by decomposing it into detection of sensitive topics. They collect YouTube comments from controversial videos, rely on manual annotation to characterize them, and perform a bag-of-words driven text classification. Hee at al. [38] study linguistic characteristics in cyberbullying-related content extracted from Ask.fm, also aiming to detect fine-grained types of cyberbullying types, such as threats and insults. Besides the victim and the harasser, they also identify the bystander-defenders and the bystander-assistants, who support, respectively, the victim or the harasser. Hosseinmardi et al. [16] study images posted on Instagram and their associated comments to detect and distinguish between cyber-aggression and cyberbullying.
Attribute selection. A number of methods have been relied upon to perform detection of harassment on social media. For instance, text features are often used to extract attributes that are in turn leveraged for classification. These include punctuations, URLs, part-of-speech, n-grams, Bag of Words (BoW), as well as lexical features relying on dictionaries of offensive words, and user-based features such as user's membership duration activity, number of friends/followers, etc. Different supervised approaches have then been used for detection: [28] use a regression model, whereas, [8,9,38] rely on various machine learning methods such as Naive Bayes, Support Vector Machines (SVM), and Decision Trees (J48). By contrast, [15] use a graph-based approach based on likes and comments to build bipartite graphs and identify negative behavior. A similar graph-based approach is also used by [16].
Sentiment analysis of text can also contribute useful features in detecting offensive or abusive content. For instance, [26], besides basic features such as a BoW, also use sentiment scores of data collected from Kongregate (an online gaming site), Slashdot, and MySpace. They also use a probabilistic sentiment analysis approach to distinguish between bullies and non-bullies, and rank the most influential users based on a predator-victim graph (built from exchanged messages). Finally, [40] also rely on sentiment to identify victims on Twitter who pose high risk either to themselves or to others. Apart from using positive and negative sentiments, they also consider specific emotions, such as anger, embarrassment and sadness.
Overall, our work advances the state-of-art on cyberbullying and aggression detection by proposing a scalable methodology for large-scale analysis and extraction of text, user, and network based features on Twitter. Our novel methodology analyzes users' tweets, individually and in groups, and extracts appropriate features that connect user behavior with a tendency to be aggressive or bully. We examine the importance of such attributes, and advance the stateof-art by focusing on new network-related attributes that further distinguish the specific user behaviors.

Methodology
Our approach to detect aggressive and bullying behavior on Twitter, as summarized in Figure 1, involves the following steps: (1) data collection, (2) preprocessing of tweets, (3) extracting user-, text-and network-level features, and (4) user modeling and characterization.
Data Collection. Our first step is to collect tweets and, naturally, there are a few possible ways to do so. In this paper, we rely on Twitter's Streaming API, which provides free access to 1% of all tweets. The API returns each tweet in a JSON format containing the content of the tweet, some metadata (e.g., creation time, whether it is a reply or a retweet, etc.), as well as information about the poster (e.g., username, followers, friends, number of total posted tweets).
Preprocessing. Next, we remove stop words, URLs, as well as punctuation marks, and perform normalization -i.e., we eliminate repeated letters and repetitive characters which users often use to express their feelings more intensely (e.g., the word 'yessss' is converted to 'yes'). This step also involves spam removal, which can be done using a few different techniques [12,39] relying on tweeting behavior (e.g., using many hashtags per tweet) or network features (e.g., spam accounts forming micro-clusters).
Sessionization. As mentioned in the Introduction, analyzing single tweets does not provide enough context to discern when a user is behaving in an aggressive or bullying way. Consequently, we group tweets from the same user, based on time clusters, into sessions, which allows us to analyze contents of sessions rather than single tweets.
Building Ground Truth with Crowdsourcing. Next, we build ground truth (needed for machine learning classification, as explained below) using crowdsourcing. More specifically, we rely on human annotators, who are provided with a set of tweets from a user and asked to classify them accordingly to predefined labels.
Feature Extraction. The next step is to extract features from tweets and the users that post them. These can be user-, text-, and network-based features, such as the number of followers, tweets, hashtags, etc. The selection of appropriate features is obviously a very important step to speed up and improve learning quality [20].
Classification. The final step is to perform classification using the extracted features and the ground truth. Naturally, different machine learning techniques can be used for this task, including probabilistic classifiers (e.g., Naïve Bayes), decision trees (e.g., J48) or ensembles (e.g., Random Forests), as well as neural networks.

Scalability and Online
Updates. An important challenge is enabling scalable analysis of large-scale tweet corpora. Obviously, several of the aforementioned steps can be performed in parallel, over N subsets of the data, on N cores (as depicted in Figure 1). Also, depending on whether data is processed in batches or in a streaming fashion, one can use different modeling algorithms and processing platforms -e.g., batch platforms like Hadoop vs. distributed stream processing engines like Storm. Either way, some of the steps can be periodically executed on new data, so that the model can be updated following changes in data and/or manifestation of new aggressive behaviors. We argue that our pipeline design provides several benefits with respect to performance, accuracy, and extensibility. First, we can handle the analysis of large volumes of tweets. Second, it allows regular updates of the model, thus capturing previously unseen human be-haviors. Third, we can easily plug in new features, e.g., new metadata becoming available from the Twitter platform, or from new research results. Finally, different components can be updated or extended with new technologies, e.g., allowing for better data cleaning, feature extraction, and modeling.

Dataset & Ground truth
In this section, we present the data used throughout the rest of the paper, which we collected between June and August 2016, and the way we process it to build ground truth. We start by gathering two sets of tweets: 2. (Hate-related) A set of 650k tweets collected from the Twitter Streaming API selecting 309 hashtags associated with bullying and hateful speech.
The list of 309 hashtags was compiled as follows. First, we obtain a 1% sample of public tweets from June to August 2016 from the Streaming API, and parse it to select all tweets containing #GamerGate. The GamerGate controversy [23] is one of the most well documented large-scale instances of bullying/aggressive behavior that we are aware of. It stemmed from alleged improprieties in video game journalism which quickly grew into a larger campaign centered around sexism and social justice. With individuals on both sides of the controversy using it, and extreme cases of cyberbullying and aggressive behavior associated with it (e.g., direct threats of rape and murder), #GamerGate serves as a relatively unambiguous hashtag associated with tweets that are likely to involve the type of behavior we are interested in. We use #GamerGate as a seed for a sort of snowball sampling of other hashtags likely associated with cyberbullying and aggressive behavior; we also include tweets that have one of the 308 hashtags that appeared in the same tweet as #GamerGate. Indeed, when manually examining these hashtags, we see that they contain a number of hateful words or hashtags, e.g., #IStandWithHateSpeech, #KillAllNiggers and #InternationalOffendAFeministDay.
Apart from the hate-related set, we also crawl a random set of tweets which served as a baseline as it is less prone to contain any kind of offensive behaviors. To examine whether there are actual differences among the two sets we consider the number of followers, the usage of hashtags in users' posts and the expressed sentiment. Figures 2a and 2b show that there are substantial differences among the users' social and tweeting activity. We observe that users from the hate-related set have more followers compared to the baseline set. This could be because users with aggressive behavior tend to accumulate more popularity in their network ( Figure 2a). Also, baseline users tweet with fewer hashtags than users from the hate-related dataset (Figure 2b). This could be because users from the hate-related dataset used Twitter as a rebroadcasting mechanism aiming at attracting attention to the topic. Finally, Figure 2c shows that the hate-related set contains more

Preprocessing
Next, we perform several steps to prepare the data for crowdsourced labeling using human annotators, and to build ground truth.
Cleaning. The first step is to cleanup the data of noise. I.e., removing URLs, numbers, stop words, emoticons and punctuations, as well as converting all characters to lower case. Removing Spammers. Previous work has shown that Twitter contains a non-negligible amount of spammers [5], i.e., users posting unsolicited content, and proposed a number of detection tools [12,39]. Therefore, we perform a first-level detection of such users and remove them from our dataset. Following [39], we use two main indicators of spam: (i) using a large number of hashtags in their tweets (to boost their visibility), and (ii) posting a large number of tweets that are highly similar to each other.
To find optimal cutoffs for these heuristics, we study both the distribution of hashtags and the duplication of tweets. Concerning the hashtags distribution we observed that the average number of hashtags within a user's posts ranges from 0 to about 17. We experimented with different possible cutoffs and after a manual inspection on a sample of posts we set up the limit to 5 hashtags. So, users with more than 5 hashtags per tweets on average are removed.
Next, we estimate the similarity of a user's tweets via Levenshtein distance [27], i.e., the minimum number of singlecharacter edits needed to convert one string into another, averaging it out over all pairs of their tweets. For each user's posts initially we calculate the intra-tweets similarity. So, for a user with x tweets, we conclude to a set of n similarity scores, where n = x(x − 1)/2. Then, we compute the average intra-tweets similarity per user. If it is above 0.8 we exclude those users and their posting activity. Figure 2d shows that about 7% of the users have a high percentage of similar posts. Such users were excluded from our dataset.

Sessionization
Since cyberbullying behavior usually involves repetitive actions, we aim to study users' tweets over time. To this end, for each user, we create sets of time-sorted tweets (sessions) per user, by grouping closely posted tweets. Figure 2e overviews our sessionization process, which we discuss in detail below.
First, we remove users who are not significantly active, i.e., those posting less than 5 tweets in the examined 3 month period. Then, we use a session-based model where for each session S i , the inter-arrival time between tweets does not exceed a predefined time threshold t l . We experimented with various values of t l to find an optimal session duration (i.e. not extremely large or with no tweets within a session) and settled for 8 hours as a threshold. The minimum, median and maximum length of the resulting sessions (in terms of the number of their included tweets) for the #GamerGate dataset are 12, 22 and 2.6k; for the baseline set of tweets they are 5, 44, and 1.6k.
Next, we divide sessions in batches since otherwise they contain too much information to be carefully examined by a crowdworker within a reasonable time. To find the optimal size of a batch, we performed preliminary labeling runs on CrowdFlower (see the next section for details), with 100 workers each, using batches of size exactly 5, 5-10, and 5-20 tweets. Our intuition is that if we increase the batch size, it provides more context to the worker to assess if a poster is acting in an aggressive or bullying behavior. We noticed that the best results with respect to labeling agreement occur with 5-10 tweets batches, therefore, we eliminate sessions with fewer than 5 tweets and further split those with more than 10 (preserving the chronological ordering of their posted time). In the end, we obtain 1,500 batches. We maintain the same number of batches for both the hate-related and baseline tweets.

Crowdsourced Labeling
We now present the design of our crowdsourcing labeling study, performed on crowdflower.com.
Labeling. We aim to label each Twitter user as normal, ag-  gressive, bullying, or spammer by analyzing their batch(es) of tweets. Note that we allow for the possibility that a user is spamming and has passed our basic spam filtering. We provide simple definitions of aggressive and bullying behavior, building on previous research [13,33,37], which define cyberbullying as "a repeated and hostile behavior by a group or an individual, using electronic forms of contact" and cyberaggression as "intentional harm delivered by the use of electronic means to a person or a group of people irrespective of their age, who perceive(s) such acts as offensive, derogatory, harmful, or unwanted". In our case, the workers were provided with the following definitions in terms of the cyberaggressive, cyberbullying, and spam behaviors: 1. cyberaggressive user: someone who posts at least one tweet or retweet with negative meaning, with the intent to harm or insult other users (e.g., the original poster of a tweet, a group of users, etc.); 2. cyberbullying user: someone who posts multiple (at least two) tweets or retweets with negative meaning for the same topic and in a repeated fashion, with the intent to harm or insult other users (e.g., the original poster of a tweet, a minor, a group of users, etc.) who may not be able to easily defend themselves during the postings; 3. spammer user: someone who posts texts of advertising/marketing or other suspicious nature, such as to sell products of adult nature, and phishing attempts.
Control Cases. To assess the reliability of each worker, we also give them three control cases (batches), and randomly select one, for each category of user (bullying, aggressive, and spammer). CrowdFlower Task. As mentioned, we use the Crowd-Flower platform to recruit human workers to complete the labeling tasks. We redirect workers to an online survey tool we developed. First, they were asked to complete basic demographic questions, i.e., gender, age, nationality, education level, and annual income. In total, 30% are female and 70% male, while their educational level varies from secondary education (18.4%), bachelor degree (35.2%), master (44%) and phd (2.4%). One third (35.5%) claimed to have an income level below e10k, and about 20% between e10k and e20k.
The rest are spread in the e20k-e100k range. About 27% are 18-24 years old, 30% between 25-31 years old, 21% 32-38 years old, 12% 39-45 years old, and the rest above 45 years old. Finally, they come from 56 different countries, but with significant participation of users from USA, Venezuela, Russia, and Nigeria. Overall, the annotators from the top 10 countries contributed 3/4 of all annotations. We then ask workers to label 10 batches (where one is the control case).
We also provide them with the user profile description (if any) and the definition of aggressive, bullying, and spammer behaviors. Figure 3 presents an example of the interface. The workers rated the instructions given to them, as well as the overall task, as very good with a score of 4 out 5.
Results. We recruited 834 workers, whom we allowed to participate only once to eliminate behavioral bias across tasks and discourage rushed tasks. Each batch was labeled by 5 different workers and a majority vote was used to decide the final label. We discarded the 193 batches where no majority vote was found, leaving us with 1,307 batches (9,484 tweets in total), where 4.5% correspond to bully users, 3.4% to aggressive, 31.8% to spammers, and 60.3% to normal users. Overall, the percentage of the abusive users (i.e., bullies and aggressors) is about 8%, similar to the observations of already existing works, e.g., in [18] 9% of users exhibited bad behavior and in [1] 7% of users cheated; i.e., #Gamer-Gate and associated tweets resulted in a representative sample of aggressive/bullying content. The inter-rater agreement, based on the Fleiss' kappa statistical measure [11], which assesses the reliability of agreement between a fixed number of raters, is 21.89%; a fair agreement between our workers [22]. As noted earlier, we use control cases to assess the credibility level of the workers, finding 66.5% overall accuracy (i.e., percent of correctly annotated control cases). More specifically, we see 83.75% accuracy for spam, 53.56% for bully, and 61.31% for aggressive control cases.

Feature Extraction
To perform the machine learning modeling of the user behaviors identified in the dataset, we focus on user-, text-, and network-based features. Next, we detail the features from each category, while a summary is shown in Table 1. avg. sentiment score, avg. emotional scores, hate score avg. word embedding score, avg. curse score Network # friends, # followers, hubs, (d=#followers/#friends), authority (total: 11) avg. power diff. with mentioned users, clustering coefficient, reciprocity eigenvector centrality, closeness centrality, louvain modularity Table 1: Features considered in the study.

User-based features
Basics. We experimented with various user-based features; in particular, features that have been extracted from a user's profile. Features in this category include number of tweets a user has made, the age of his account (i.e. number of days that passed since its creation), the number of lists the user has subscribed to, if the account is verified or not (i.e., acknowledged by Twitter as an account linked to a famous user), and whether or not the user still uses the default profile image (the last two were excluded from the analysis as they did not offer any added value). A representative example is shown in Figure 4a, which plots the CDF of the number of subscribed lists for each of the four behaviors we examine (we note that the maximum number of lists is 4,327, but we trim the plot at 500 for readability). The median (max) number of lists for bullying, spammy, aggressive, and normal users is 24 (428), 57 (3,723), 40, (1,075), and 74 (4,327), respectively. We note the difference in the participation of groups from each class of users, with normal users signing up to more lists than the other types of users.
Sessions' statistics. Here, we considered the total number of sessions produced by a user from June to August and we estimated several statistics, i.e., average, median, and standard deviation of size of each users' sessions. In the end, these features were excluded from the analysis as even though there was a slight increase the correctly classified bully cases, there was an important negative effect in detecting the aggressive ones.
Interarrival time. We estimated the inter-arrival time of a user's posts by considering all of his activity from June to August. From Figure 4b we observe that bullies and aggressors tend to have less waiting time in their posting activity compared to the spam and normal users, which is in alignment with the results obtained in [16].

Text-based features
For text-based features, we looked deeper into a user's tweeting activity by analyzing specific attributes that exist in his tweets.
Basics. We consider some basic metrics across a user's tweets, e.g., number of hashtags used, uppercase text (which can be indicative of intense emotional state, e.g., 'shouting'), number of emoticons, and URLs. For each of these, we took the average over all tweets in a users' annotated batch. Figure 4c depicts the CDF of the average number of URLs for the different classes of users. The median value for the bully and spam users is 1, for the aggressive 0.9, and for the normal users 0.6. An important difference exists in the maximum average number of URLs in the 4 classes: for the bully and aggressive users it is 1.17 and 2, respectively, while for spam and normal users it is 2.375 and 1.375. Thus, we see a clear separation on the tendency of URL posting. Also, from Figure 4d we observe that aggressive and bully users have a propensity to use fewer hashtags within their tweets, which makes sense since their main target is to attack a person or a specific group rather than disseminating information.
Word embedding. Word embedding allows finding both semantic and syntactic relation of words, which permits the capturing of more refined attributes and contextual cues that are inherent in the human language, e.g., people often use irony to express their aggressiveness or repulsion. Therefore, we considered the Word2Vec [24], an unsupervised word embedding-based approach to detect the semantic and syntactic word relations. Word2Vec is a two-layer neural network that operates on a set of texts to 1) initially establish a vocabulary based on the words included in such set more times than a user-defined threshold (to eliminate noise), 2) apply a learning model to input texts to learn the words' vector representations in a D-dimensional space, and 3) output a vector representation for each word encountered in the input texts. D is user-defined, while based on [24] 50-300 dimensions can model hundreds of millions of words with high accuracy. Two methods can be used to build the actual model, i.e., CBOW (i.e., Continuous bag of words), which uses context to predict a target word, and Skip-gram, which uses a word to predict a target context. Skip-gram works well with small amount of training data and represents well even rare words or phrases, while CBOW shows slightly better accuracy for the frequent words and is faster to train. To train the Word2Vec, either large textual corpora are used (e.g., Wikipedia articles in a selected language), or more thematic textual collections to better embed the word usage in the targeted domain. Here, we use Word2Vec to generate features to better capture the context of the data at hand. We used a pre-trained model with a large scale thematic coverage (with 300 dimensions) and applied the CBOW model due to its better performance regarding the training execution time. Finally, having at hand the vector representations of all input texts' words, the overall vector representation of an input text was derived by averaging all the vectors of all its comprising words. Overall, word embedding feature had a little contribution in distinguishing bullies and aggressors from the normal users, which is also in accordance with [28] who attempted to detect abusive language from user comments.
Sentiment. Sentiment has been considered a main attribute for detecting offensive or abusive behavior in communications between individuals as in [26]. To detect sentiment, we used the SentiStrength tool 2 , which estimates the positive and negative sentiment (i.e., on a [-4, 4] scale) in short texts, even for informal language often used on Twitter. First, however, we evaluated its performance by applying it on an already annotated dataset containing a total of 7,086 tweets. 3 The overall accuracy was 92%, attesting to its efficacy for our purposes. Figure 4e plots the CDF of average sentiment for the 4 user classes. Especially in the aggressive case, there is a clear distinction in sentiment, indicating that it is quite useful in distinguishing aggressors from the other classes. We also attempted to detect more concrete emotions, i.e., anger, disgust, fear, joy, sadness and surprise, but they were not used in the analysis as they did not contribute to better distinguish among different user categories.
Hate and curse words. Additionally, we wanted to specifically examine the existence of hate speech and curse words within the user posted texts. For this purpose, we used the Hatebase database 4 , which is a crowdsourced list of hate words. Each word in the Hatebase database is additionally  scored on a [0, 100] scale indicating how hateful the word is. Finally, a list of swear words 5 was also used in a binary fashion; i.e., we set a variable to true if a tweet contained any word in the list, and false otherwise. Even though these lists can be useful in categorizing general text as hateful or aggressive, we found they are not well suited for classifying tweets which are short and typically include modified words, URLs and emoticons. We did find bully and aggressive users having a minor bias towards using such words but we omit these results due to space.

Network-based features
The social network of Twitter plays a crucial role in diffusion of useful information and ideas, but also of negative opinions, rumors and abusive language (e.g. [17,29]). We study the association between aggressive or cyberbullying behavior and the position of users in the Twitter network of friends and followers. The network is comprised of about 1.2M users and 1.9M friend (i.e., someone who is followed by a user X) or follower (i.e., someone who follows a user X) edges, with 4.934 effective diameter, 0.0425 average clustering coefficient, and 24.95% and 99.99% nodes in the weakest and largest component, respectively. Users in such a network can have a varying degree of embeddedness with respect to friends or followers, reciprocity of connections, connectivity with different parts of the network, etc.
Popularity. The popularity of a user can be defined in different ways. For example, the number of friends or followers (out or in-degree centrality), and the ratio of the two measures (since Twitter allows users to follow anyone without their approval, the ratio of followers to friends can quantify a user's popularity). These measures quantify the opportunity of a user to have a positive or negative impact in his ego-network in a direct way. Figures 5a and 5b indicate that bullies have fewer friends and followers than the other user categories, with normal users having the most friends.
Reciprocity. This metric quantifies the extent to which users reciprocate the follower connections they receive from other users. The average reciprocity in our network is 0.2. In Figure 5c we show the user classes considered have different 5 http://www.noswearing.com/dictionary/ distributions, with the bully and aggressive users the more similar (i.e., higher number of reciprocities) than the normal or spam users. Reciprocity as a feature has also been used in [15], but in an interaction-based graph using likes in posts.
Here, we investigate the fundamental reciprocity in Twitter friendship and we are the first to do this in bullying context.
Power Difference. A recent study [30] found that the emotional and behavioral state of victims depend on the power of their bullies, e.g., more negative emotional experiences were observed when more popular cyberbully users conducted the attack, and the high power difference with respect to status in the network has been shown to be a significant characteristic of bullies [7]. Therefore, we consider a more elaborate feature: the power difference between a tweeter and his mentions. In fact, a further analysis of a user's mentioned users could reveal possible victims or bystanders of his aggressive or bullying behavior. Towards this, we compute the difference a user has with respect to the users he mentions in his posts in terms of the followers/friends ratio of each between the parties involved (i.e., user vs. mentioned user). Figure 5d shows the power difference distribution with respect to the ratio between a tweeter and his mentions (we note that the maximum power difference is 20, but we trim the plot for readability). We also investigate the users' position in their network by considering more elaborate measures that measure influence of a user in their immediate and extended neighborhood, as well as connectivity. In particular, we study hub and authority centrality, as well as eigenvector and closeness centrality.
Hubs and Authority. A node's hub score is the sum of the authority score of the nodes that point to it, and authority shows how many different hubs a user is connected with [21].
Influence. Eigenvector centrality measures the influence of a user in his network, immediate or extended over multiple hops. Closeness centrality measures the extent to which a user is close to each other user in the network. To calculate the last four measures, we considered both the followers and friends relations of the users under examination in an undirected version of the network. Figures 6a, 6b, and 6c show the CDFs of the hubs (max value: 0.861), authorities (max value: 0.377) and eigenvector (max value: 0.0041) scores for the four classes of users. We observe that bully users tend to  have lower values in their hub and authority scores which indicate they are not quite popular within their networks.
In terms of influence on their ego and extended network, they follow similar behavior with spammers, while aggressors seem to have influence closer to normal users. Closeness centrality was excluded from the analysis since there was not significant contribution in distinguishing among the four user classes (we also omit the CDF due to space).
Communities. Previous work [14] highlighted that bullies tend to experience social rejection from their environment and so they face difficulties in developing social relations. We examined the usefulness of this attribute and calculated the clustering coefficient measure, which shows a user's tendency to cluster with others of his network. Figure 6d plots the CDF of the clustering coefficient among the four behavioral classes. We observe that bully users similar to the spam ones are less prone to create clusters in relation to aggressive and normal users. Finally, we computed communities using the Louvain method [2] which is suitable for identifying groups on large networks, as it attempts to optimize the modularity measure (how densely connected the nodes within a cluster are) of a network by moving nodes from one cluster to another. Overall, we observed a few communities with a high number of nodes (especially the network core) resulting in a feature with very low distinguishing power, thus it was not considered in the classification.

Modeling Aggression & Bullying
In this section we discuss the effort to model bullying and aggression behaviors on Twitter, using the features extracted and the labels provided by the crowdsourcing survey. We examined the performance of several supervised classification algorithms using all and subsets of the labels provided.

Experimental Setup
We considered various machine learning algorithms, either probabilistic, tree-based, or ensemble classifiers (built upon a set of classifiers whose individual decisions are then combined to classify new data). Due to space limitations, here we present the best results with respect to time for training and performance, which were obtained with the Random Forest tree-based classifier. For all the experiments presented next we used the WEKA data mining toolkit and repeated (10 times) 10-fold cross validation [19]. Also, we did not balance the data to be in accordance with real life behavior.
Tree-based classifiers. Tree-based classifiers are relatively fast compared to other classification models [31]. They are comprised of three types of nodes: (i) the root node, with no incoming edges, (ii) the internal nodes, with just one incoming edge and two or more outgoing edges, and (iii) the leaf node, with one incoming edge and no outgoing edges. The root and each internal node correspond to feature test conditions (in the simplest form, each test corresponds to a single feature) for separating data based on their characteristics, while the leaf nodes correspond to the available classes.
We experimented with various tree-based classifiers: J48, LADTree, LMT, NBTree, Random Forest (RF), and Functional Tree. We achieved the best performance with the RF classifier, which constructs a forest of decision trees with random subsets of features during the classification process.
Evaluation. For evaluation purposes, we examined standard machine learning performance metrics for the output model: precision (prec), recall (rec), and weighted area under the ROC curve (ROC), at the class level and overall average across classes. Also, the overall kappa, root mean squared error (RMSE) and accuracy values are presented.
Experimentation phases. Two experimental setups were tested to assess the feasibility of detecting user behavior: (i) 4-classes classification, i.e., bully, aggressive, spam and normal users, and (ii) 3-classes classification, i.e., bully, aggressive and normal users. This setup examines the case where we filter out spam with a more elaborate technique and attempt to detect the bullies and aggressors from normal users.

Classification results
Detecting offensive classes. Here, we examine whether it is possible to distinguish between bully, aggressive, spam and normal users. Table 2a overviews the results obtained with the Random Forest classifier. In more detail, we observe that the classifier succeeds in detecting 43.2% (STD., 0.042) of the bully cases, which is quite satisfactory compared to the small number of bully cases identified to begin with (only 4.5% of our dataset). In the aggressive case, we observe that   recall is quite low, 11.8% (STD., 0.078). Based on the confusion matrix (omitted due to space limits), the misclassified cases mostly fall in either the normal or bullying classes, which is in alignment with the human annotations gathered during the crowdsourcing phase. Overall, the average precision is 71.6%, and the recall is 73.32%, while the accuracy equals to 73.45% with 47.17% kappa and 30.86% RMSE.
Classifying after spam removal. In this experimental phase, we wanted to explore whether the distinction between bully/aggressive and normal users would be more evident after applying a more sophisticated spam removal process in the preprocessing step. To this end, we removed from our dataset all the spam related cases that the annotators had identified and re-ran the classification process again with the RF classifier. Considering the AUC of 0.907, we believe that with a more sophisticated spam detection applied on an incoming stream of tweets, our features and classification techniques can perform quite well at detecting bullies and aggressors and distinguish them from the typical Twitter users.
Features evaluation. Table 3 shows the top 12 features for each experimental setup (based on the information gain).
Overall, in both experiments the most contributing features tend to be the user-, and network-based which show how active and well-connected a user is with his network.
Balancing data. Based on [4], similar to almost all classifiers, Random Forest suffers from appropriately handling extremely imbalanced training dataset (similar to our case) resulting to more bias on the majority classes. To address this issue we followed an over-sampling (based on SMOTE which creates synthetic instances of the minority class) and under-sampling (resample technique without replacement) approach at the same time as it has proven to result in a better performance [3]. Here, we focus on the 3-class experimentation setup (i.e., without considering the spam user class) and so, after randomly splitting the data into 90% for training and 10% for testing sets, then we proceeded with the balancing of the training set. The resulted data distribution was 349, 386 and 340 instances for the bully, aggressive and normal classes, respectively, while there was not any further processing of the testing set. Table 4 shows the obtained results. After balancing the data the classifier succeeds to detect 66.7% and 40% of the bully and aggressive cases, respectively, while overall, the accuracy is 91.25% with 59.65% kappa and 14.23% RMSE.

Discussion & Conclusion
While the digital revolution has brought immense advances in communication, and social interaction, it has also enabled Prec. Rec.  wider proliferation of harmful behavior. Unfortunately, effective tools for detecting this type of harmful actions are lacking, as this type of behavior can often be ambiguous in nature and/or exhibited via seemingly superficial comments and criticisms. To this end, this paper presented a novel system geared to automatically classify two kinds of harmful online behavior, namely cyberaggression and cyberbullying, focusing on the Twitter social network. We relied on crowdsourced workers to label 1.5k users as either normal, spammers, aggressive, or bullies from a corpus of almost 10k tweets, using an efficient, streamlined labeling process. We investigated 3 types of attributes -user, text, and network based -that characterize such behaviors, with a total of 30 features. We found that bully users are less popular (fewer followers/friends, lower hub, authority and eigenvector scores) and do not participate in many communities. Even though they are not very active w.r.t. number of posts, when they do, they post more frequently than others, and do so with fewer hashtags, urls, etc. Moreover, they tend to have a long activity on Twitter based on the age of their account.
Interestingly, aggressive users show similar behavior as spammers, e.g., in terms of number of followers, friends, and hub scores. Similarly to bullies, they also do not post a lot of tweets, but exhibit a small response time between postings, and use few hashtags and URLs in their tweets. Also, such users tend to have been on Twitter for a long time, like bullies. However, their posts seem to be more negative in sentiment than bullies or normal users. On the other hand, normal users are quite popular with respect to number of followers, friends, hubs, authorities. They participate in many topical lists, and use a large number of hashtags and URLs.
These observations are in line with the intuition that bully and aggressive users tend to attack, in rapid fashion, particular users or groups they target, and do so in short bursts, with not enough duration or content to be detected by Twitter's systems. In general, we find that aggressive users are more difficult to characterize and identify using a machine learning classifier than bullies, since sometimes they behave like bullies, but other times as normal or spam users.
Finally, we showed that our methodology for data analysis, labeling, and classification can scale up to millions of tweets, while our machine learning model built with a Random Forest classifier can distinguish between normal, aggressive, and cyberbullying users with high accuracy, i.e., 91.08%. In fact, while prior work almost exclusively focused on user-and text-based features (e.g., linguistics, sentiment, membership duration), we performed a thorough analysis of network-based features, and found them to be very useful to the particular user modeling task. We found such features to be most effective for classifying user aggressive behavior (half of the top-12 features in classification power are network based), followed by user-based features, while being easy to extract with respect to computation and analysis time. Surprisingly, text-based features do not contribute as much to the detection of aggression (with an exception of tweet characteristics such as number of URLs, hashtags, and sentiment).