Selecting User Influence on Twitter Data Using Skyline Query under MapReduce Framework

The aim of this research was to select and identify user influence on Twitter data. In identification stage, the method proposed in this study was matrix Twitter approach, sentiment analysis, and characterization of the opinion leader. The importan characteristics included external communication, accessib ility, and innovation. Based on these characteristics and information from Twitter data through matrix Twitter and sentiment analysis, a algorithm of skyline query was constructed for the selection stage. Algorithm of skyline query selected user influence by comparing with other users according to values of each characteristic. Thus, user influence was indicated as user that was not influenced b y other users in any combination of skyline objects. The use of MapReduce framework model in identification and selection stage, support whole operation where Twitter had b ig size data and rapid changes. The results in identification and selection of user influence exhib ited that MapReduce framework minimized the execution time, whereas in parallel skyline query could reveal user influence on the data.


Introduction
Twitter has been a popular media in Indonesia that is widely used for sharing opinion and propaganda. Based on motivation "what's happening?", The users deliver their opinion and propaganda with less than 140 characters as called by tweet. The propaganda-related tweet is commonly found in voters of a public election [1] and a product marketing [2]. In the public election, propaganda is addressed to national figures or prospective leaders, while, in marketing, it is related to product branding. The use of social media Twitter to blow up propaganda is called by viral marketing [3].
One of the concepts used in the viral marketing is user influence [4]. It is a user in Twitter that is capable to influence other users in making decision. Matrix Twitter (MT) is one of the methods used for impact measurement. MT is a mathematic symbol that provides information related to Twitter user in numerical values such as number of follower, following, reply, and tweet [5]. These values are combined in formula or algorithm like in the use of algorithm PageRank in selecting user influence [6] The use of MT in identifying user influence was conducted by [7] based on characteristics of opinion leader and machine learning algorithm. However the study had a weakness; analysis of user influence was only based on MT. MT only measures users based on their information and relation to other users. Septiandri and Purwarianti (2013) stated that the influence level of users may result from their tweet. The reply sentiment in a tweet demonstrated the influence of the user [8].
Furthermore, machine learning algorithm highly depends on data used in the training and measurement, thus model consistency is always required in different various data. Different data possibly cause inaccuracy in identifying the user influence. [9] used skyline query to identify and select users in Facebook. Skyline query is an algorithm to obtain an object that is not dominated by other objects [10]. The object is called as skyline. The user influence is considered when the user is not dominated by other users. These users indicate better particular characteristics. Hence this research used MT feature and t weet content analysis to select user influence using skyline query based on characteristics of opinion leader (OL).

1417
OL represents the individuals that are able to influence others in decision making of a community [11]. User influence in twitter has similarity to OL [12]. [13] stated that OL could be seen from characteristics including external communication, innovation, accessibility, and social-economy status. Different from common skyline query, selection of user influence using OL characteristics is not simple. The characteristics are not directly from Twitter data. Increasing Twitter data, data processing is unable to be processed using simple approach in a computer. MapReduce in a group of computer is a solution for determination of user influence. It is a programming model and software that relates to big size data [14]. MapReduce framework can process distributed data in the group of computer. Based on the description, the study aimed to (1) identify OL based user influence according to MT and tweet content analysis in MapReduce framework, and (2) apply algorithm of skyline query in MapReduce framework to select user influence.

Research Method
The method consists of 3 stages, 1) pre-processing 2) identification of user influence, and 3) determination of user influence. Data collection based on topics and sentiment analysis are conducted in pre-processing, while selection of user influence includes skyline process and top-k query. Identification and selection of user influence are processed in MapReduce framework. The research flowchart is presented in Figure 1.

. Data collection based on topics
This step identifies a character to obtain national figure for topic and keyword in data collection. This uses contextual and morphology on data article from online news media and channels of the political party such as detik.com. Article data was obtained by using Really Simple Syndication (RSS). Contextual and morphology rule are chosen since they have better accuracy compared to association mining rule in identifying the characters [15]. The characters obtained are then used to determine keywords. The data was collected using library Twitter4j. In this study, only opinion tweet is selected for next stage. Therefore, normalization and selection of opinion are performed. Part of speech tagging, Hidden Markov Model algorithm (HMM) and opinion sentence rule are used to filter the opinion [16]. This step results in Twitter containing opinion that comment national figures.

Sentiment analysis
The step aims to obtain model of sentiment classification. The model is used to determine the level of reply sentiment in user influence. The step employs hybrid, a combination of machine learning algorithm and sentiment dictionary (lexicons) to gain classification model [17]. The hybrid method consists of several process including corpus formation, normalization, feature reduction, standardization, mark extraction, and classification. Formation of corpus is based on [18][19], which uses emoticon characters. Normalization is to improve invalid data in Twitter [20]. Feature reduction aims to reduce unused word/characters such as punctuation and stop words. Standardization of sentiment dictionary consists of translation, validation, and standardization. Translation is used by means of library TextBlob in Bing Liu [21], SentiStrength Emoticons [22], SentiWords [23], and MPQA Subjectivity (MPQA-S) [24]. Validation employs Kamus Besar Bahasa Indonesia (KBBI) to make sure that translated output has meaning in Indonesia language. Furthermore, each dictionary has different sensitivity range. Therefore, standardization is applied in value 0 and 1 [25].
Standardized dictionary is used in mark extraction. This step is to determine vector of mark sentiment used in the sentiment classification. The marks used are from [26] with modification, namely, word number of positive, negative, and neutral sentiments, word number of sentiments, word value of sentiment, maximum value of positive and negative sentiments, and number of negation word in the tweet. The vector and machine learning algorithm, Naive Bayes Classification (NBC), are used in the classification to achieve classification model. The algorithm NBC provides high accuracy in the classification of text document [27]

Identification of user influence
Identification of user influence is based on specification and characteristic s of OL, MT, and classification model of sentiment. Therefore, data attribute, MT calculation, and sentiment weight of user are analyzed. The proses uses BPMN workflow for analyzing and modelling a process. Programming language java and Hadoop are used for implementation. BPMN used is Sybase Power Designer 16.0. BPMN is understandable notation for analyzing, modelling, and representing process stream. The OL characteristics are external communication, accessibility, and innovation adaptation. Additionally, social-economy status is discarded since Twitter users have good social-economy status [7]. a. Analysis of data attribute Attribute analysis aims to determine data attribute for calculation of MT value, which is conducted using data observation. b. Calculation of MT value Calculation of MT value is based on MT [5]. This step is needed since some values of MT are not directly obtained from Twitter data. The output of this process is Twitter user and MT value. c. Calculation of sentiment weight Calculation of sentiment weight uses model from sentiment analysis. The use of sentiment model only focuses on sentiment value of tweet, which is a reply. This is because value of reply sentiment on tweet shows user'ss influence [8]

Selecting user influence 2.3.1 Skyline query
Skyline query is a method to select objects that are not dominated by other objects in data. These objects are skyline. Skyline algorithm includes Block Nested Loops (BNL) [28], Sort Filter Skyline (SFS) [29], Nearest Neighbor Skyline (NN) [30], Bitmap, Index and Branch-and-Bound Skyline [31]. Sort Filter Skyline (SFS) algorithm is development of Block Nested Loops (BNL) algorithm.
Based on the value from identification, a user is skyline if and only if this user is not dominated by other users. User i is not dominated by other users when value of user I in each OL characteristics is better than other users. Therefore, a user influence is not dominated by other users based on OL characteristics. Skyline query in the current study uses algorithm Sort Filter Skyline (SFS) in MapReduce framework [32].
SFS in MapReduce framework consists of tow steps, in which determination of local and global skyline. Determination of local skyline is initially conducted by splitting data. Each part is counted to attain skyline based on SFS algorithm. Skyline obtained from each part is merged and filtered to gain global skyline. Partition process employs technique used in Nearest Neighbor and Divide & Conqure. Data is divided by 2 d parts based on median value from each data attribute, which d is number of data attribute. Each part is defined as d-bit, which is used in key-value of function map and reduce. The result of local skyline determination is user/skyline influence from each part. The data were used in the determination of global skyline to obtain user influence from whole data.

Top-k query
Top-k query is ranking method for user influence, which k is a positive integer. The value defines expected number of the most influential users. Based on user influence obtained from skyline query, the users are ranked to find the most influential user. The process is conducted using top-k query according to [33]. This is because the algorithm does not depend on value function of data and is effective in determining the best object. Additionally, the algorithm can be processed in MapReduce framework. The following points are steps for top-k algorithm in MapReduce framework.
a. Data form skyline query was split according to data attribute (m) { , , }. Value m in this study was 3. b. Data m+1 { m } is made, which the value is addition of three previous attributes. c. In map process, data is ranked in each part according to ascending function and then process the ranking. d. Map process output is a pair of user and ranking value (r) in each data that represents temporary key-value. e. Combiner step is processed for the user pair and r form map process, which groups user-based-data. f.
The output from previous step is a pair of user and r with similar user. This output is used in reduce step. g. In reduce step, sum-r is processed in each user group. h. The output of reduce step is pair of user and sum-r. i.
The ranking is done according to ascending function and sum-r value. User with low sum-r is user influence on data. j.
Pair of user and sum-r is saved in accordance with value k.

Results and Analysis 3.1. Data
In this work, a focus on Indonesian national figures were taken pleace. They are selected for the topic in data collection are Joko Widodo (jokowi), Jusuf Kalla, Ridwan Kamil (ridwan kamil) and Basuki Tjahaja Purnama (Ahok). These names are gained from identification process in January 2016. Based on these names, this study creates two groups of data containing opinion for the topic, corpus data and election.
First, corpus is data used as corpus to create model of sentiment c lassification. Data contain tweet collected in April 2016. The filtration is 1281 tweet consisting of 891 positive tweet, 289 negative tweet, and 101 neutral tweet. Second, selection data is used for identification and determination of user influence. The data are collected in April 2016. Identification of spam or bot is processed in data using Twitter follower-following ratio (TFF), which is useful to discard spam or bot users. TFF result is exhibited in Figure 2

Sentiment analysis model
Sentiment analysis with a setup of hybrid approach on corpus data shows accuracy 75.57% by applying a 10-fold cross validation. Thus, the model was applied for classification of selection data to determine sentiment weight of the user. Table 1 illustrated number of tweet and user in selection data with sentiment groups. In selection data, 11.326 tweet are tweet reply.

Characteristics value of user influence
BPMN in identifying user influence is presented in Figure 3. Initial identification is by analyzing data attribute of Twitter, calculating MT value, and determining sentiment weight of user. Based on data observation, Twitter data consists of 3 groups of attribute, attribute connecting with user activity, user profile, tweet content.

Figure 3. BPMN for identification of user influence
User activity is an attribute that connects with the user activity on Twitter. User profile is data attribute that contains user information related to user and its relation to other users including name, number of follower, following, etc. Tweet content describes information in tweet, such as user_mention, hastag, and symbols. Table 2 is results of data attribute analysis that is usable for calculating MT value. The calculation is based on each user in selection data and attribute in Table 2. User that has ID RP3 is more than zero (RP3>0) is counted its sentiment weight using Equation 1. Data with MT value and sentiment weight are used to select value from each characteristic of user influence. u ∑ u-r (1) BS u is sentiment weight u which is an value of reply sentiment ( u-r ) on tweet of user u in data. Sentiment 2 for positive sentiment reply, -2 for negative sentiment reply, and 0 for positive sentiment reply.

External Communication
External communication indicates popularity. In Twitter, the popularity can be observed from F1 and F3 of MT, number of follower and following. User with the highest follower maybe not because of its tweet content quality, but it is socially active. Consequently, it has many friends and followers. Moreover, it may possible for user with the highest follower since it shares useful information, which is subsequently followed by other users. Therefore, value of external communication is calculated using equation 2 [34]. The user with external communication value approximately 1 is the best user based on this characteristic. ollower ank (2)

Accessibility
Accessibility in Twitter based on MT as shown in Table 2 as seen from ID value F3, RP1, RT1, FT1, M2, appointed to following, reply, retweet, favorite/like, and mention of a user. User with high following activity shows preference for discussion with other users. A user is a following of other users, thus the user can reply, retweet, like and mention. Hence accessibility is gained from addition of each MT that represents user accessibility, calculated using formulation in equation 3. a essi ility (3)

Innovation
Innovation value in Twitter based on MT can be observed form OT1 and RT2, namely tweet and tweet that is re-tweet by other users in other users. In addition, innovation is observable from tweet sentiment given through reply from other users. Sentiment is represented by sentimen weight. Thus innovation value is obtained by adding value from each MT that represents innovation characteristic with sentiment weight. Innovation value is calculated using formulation in Equation 4. innovation u (4)

Implementation
Implementation compares between conventional approach and MapReduce. Conventional model is used to investigate efficiency and speedup MapReduce model. Conventional model is used in a single computer using programming model object oriented with aid of library Google guava. MapReduce uses library Hadoop in 3 computers (node).  Figure 4 shows graphic of time execution in MapReduce model framework with different nodes and data. Addition of node in MapReduce model shows shorter execution time and stagnant tendency in each variation of data number when data is more than 500.000 tweet. Data variations used are 100.000, 200.000, 300.000, 400.000, and more than 500.000. Therefore, additional node in data results reflected consistency of time execution, which also suggested that MapReduce model enables to distribute time complexity to each involved node. Output of this stage is user data with value from each characteris tic of user influence.

User influence
Data obtained from previous step consist of 3 attributes that represent characteristic number of user influence. Based on data, calculation is needed using algorithm SFS. Local skyline obtained are 186 users that are user influence in each part and 16 users for global skyline. Calculation of top-k query is required to determine the most influential user. Figure 5 illustrates that bigger red dots or dots on the area are user influence in data. Figure 6 shows comparison execution time of conventional model and MapReduce in selection of user influence in skyline and top-k query. Data used are 100.000, 200.000, 300.000, 400.000, and more than 500.000 users. The k value used is 2, 4, 6, 8, and 10. Processing with skyli ne query shows shorter time execution and constant using MapReduce in each various node and data. However conventional model demonstrates exponential time execution in each different data.
In algorithm top-k with various k and node, execution time is not significantly different and constant either conventional model or MapReduce due to small k value. Nevertheless, the use of MapReduce model always has shorter execution time in various data. Table 4 is an example of user influence with value k=5. User influen e y name "andreas_o_joeda" sele ted. Application of skyline and top-k query in MapReduce framework in selecting user influence points out user that is able to influence other users in shorter time. However selected users do not discuss national figures-related topics corresponding to the issues in online media portals.

Conclusion
Identification process of user influence showed that OL characteristics-external communication, accessibility, and innovation adaptation-are applicable for finding user influence in Twitter data. Based on these characteristics, the determination of user influence using by MT values and tweet content analysis achieved 16 user influences out of 65.702 users. In addition, identification process and user selection in MapReduce model demonstrates faster execution time in comparison with conventional model.
In case of future trand and investigation, the determination of user influence should be conducted in dynamic data style. Therefore selection process potentially address to current and real time position of user influence according to current issue of national figures.