COMPARISON OF CLASSIFICATION ALGORITHMS TO DETECT PHISHING WEB PAGES USING FEATURE SELECTION AND EXTRACTION

The phishing is a kind of e-commerce lure which try to steal the confidential information of the web user by making identical website of legitimate one in which the contents and images almost remains similar to the legitimate website with small changes. Another way of phishing is to make minor changes in the URL or in the domain of the legitimate website. In this paper, a number of anti-phishing toolbars have been discussed and proposed a system model to tackle the phishing attack. The proposed anti-phishing system is based on the development of the Plug-in tool for the web browser. The performance of the proposed system is studied with three different data mining classification algorithms which are Random Forest, Nearest Neighbour Classification (NNC), Bayesian Classifier (BC). To evaluate the proposed antiphishing system for the detection of phishing websites, 7690 legitimate websites and 2280 phishing websites have been collected from authorised sources like APWG database and PhishTank. After analyzing the data mining algorithms over phishing web pages, it is found that the Bayesian algorithm gives fast response and gives more accurate results than other algorithms.


INTRODUCTION
A number of governmental and private authorised agencies are working on the topic of phishing and the countermeasure the phishing attack. The APWG (Advanced Phishing Working Group) and PhishTank are two most prominent agencies which keeps all the information related to phishing and legitimate websites. According to information received from the record of APWG, the total number of unique phishing web sites detected from quarter 1 to quarter 3 were 630,494 in the year 2015 [1].
In the report, even there is no economy loss mentioned but we can think if thousands website are declaring phishing in a month worldwide, how much loss could be possible. Based on the report given by Javelin Strategy and Research on April 2012, the economy loss reached to 21 billion [4]. Nevertheless, the phishing is seriously challenging and collapses the trust to electronic commerce and e-services security systems. By watching the effect of less security in online transaction, many persons are stopping e-transactions facility. The peoples use convenient online services, since they are not sure whether their credentials are in danger or not. So to keep this thing in mind, the questions arises that how to identify the fraud and how to design and build a reliable and secure system environment for electronic business transactions. So the research study is very much necessary to reduce the online transaction problems.
To solve the problem of phishing, the researchers are finding the solution at client side and server site systems. So far, slow progress has been noticed in the client and server side design testing. On the client side application, there have been around 110 types of user-centred applications developed. These application uses web browser toolbar and additional plug-in to install additionally with the web browser. It is found that the server site strong system designing is more important requirement to protect the user from phishing attack. During the study, it is seen that server side applications are not giving successful result, but the concept of server side securities are proposing and applications are working at client site applications [5,6].

METHODS OF PHISHING ATTACK
The attacker can attack on any website in different ways. Some of methodologies are as follows [2]:  Link manipulation: Several methods of phishing attack uses some kind of technical deception which is designed to make a link in an e-mail that appears to belong to the spoofed organization. Phishers try to misspell the URLs or the use of sub-domains to target the user.
In an example of URL for http://www.mybank.services.com/, it appears that the URL is asking to login into 'mybank.services' section of the webpage, which is an phishing URL.  Filter evasion: Here phisher uses images instead of text to make it harder for anti-phishing filters to detect text, commonly used in phishing e-mails. This type of phishing takes less time to prepare the spoof websites, and it uses very less coding statements to prepare the webpage.  Website forgery: An attacker can even use flaws in a trusted website's own scripts against the victim. This type of attack (known as cross-site scripting) are particularly problematic because they direct the user to sign in at their bank or services section of web page, where everything from the web address to the security certificates appears correct.  Phone phishing: Since the use of mobile and the internet access from mobile is increasing speedily, so it is seen that not all phishing attacks requires the use of fake website. The messages come from the mobile that claimed to be from a bank which ask user to dial a phone number regarding problems with their bank account information.
 Tabnabbing: Tabnabbing is one another kind of phishing attack which directs the user to submit their login information and passwords to popular websites by impersonating those sites and convincing the user that the site is genuine [45].  DNS-Based Phishing ("Pharming"): Pharming is the term given to hosts file modification.
This type of phishing is also called DNS-based phishing. In this type of phishing, the phisher tamper with a company's host files or the DNS so that requests for URLs or name services return a bogus address and subsequent communications are directed to a fraudulent site. The targeted users do not sure that the website in which they are entering their confidential information is controlled by phisher and is probably not even in the same country as the legitimate website [7].  [8] proposed A Novel Phishing Classification Based system on URL Features. This approach is quite successful but this heuristic classification system might not be efficient on HTTP clients due to the delay with HTTP search queries, and therefore he has suggested implementing the system on a mail server where email contents are checked passively without imposing a delay on client side applications. Wardman B. et.al. [9] presented a High-Performance Content-Based Phishing Attack Detection, in which a cadre of file matching learning algorithm is implemented which is based on the websites content to detect phishing. This is possible by employing a custom data set that contains 17,992 phishing attacks targeting 159 different company brands. The results shown by Wardman for their experiments using a variety of different content-based approaches demonstrate that some can be achieved a detection rate more than 90% by maintaining a low false positive rate.

OVERVIEW OF PREVIOUS STUDY ON PHISHING
Weider D.Yu et.al. [10] presented an Phishing Detection Tool -PhishCatch in which the novel anti-phishing algorithm is developed to protect the user from phishing attack. This algorithm is based on the heuristic which can detect phishing e-mails and alert the user about phishing type emails. The phishing filters used in the algorithm and rules are formulated after extensive research of phishing methodologies and tactics as presented in the paper. After testing the algorithm, he has determined that this algorithm has a catch rate of 80% which gives an accuracy of 99%. Prakash P. et.al. [11] presented a heuristics "PhishNet" in which five heuristics has been taken to enumerate simple combinations of known phishing sites to discover new phishing URLs. In its evaluation with real-time blacklist feeds discovered around 18,000 new phishing URLs from a set of 6,000 new blacklist entries. He showed that approximate matching algorithm leads to very few false positives (3%) and negatives (5%). Isredza Rahmi A Hamid et.al. [12] suggested an Profiling Phishing E-mail Based on Clustering Approach in which an approach for profiling email-born phishing activities is proposed. Profiling phishing activities are useful in determining the activity of an individual or a particular group of phishers. By generating profiles, phishing activities can be well understood and observed. His proposed profiling email-born phishing algorithm (ProEP) demonstrates promising results with the Ratio Size rules for selecting the optimal number of clusters. Zhang H. et.al. [13] presented a framework which is based on the Bayesian approach for content-based phishing web page detection. The effectiveness of the system is examined by taking a large-scale dataset that collected from real phishing cases of trusted sources. The experimental results of Zhang demonstrated the text and image classifier that is designed to deliver promising results. They uses fusion algorithm that outperforms the individual classifiers. His model can be adapted for the further study on phishing.
Li T. et.al. [14] has proposed an offline phishing detection system named Large-scale Antiphishing by Retrospective data-eXploration (LARX). This system uses a network traffic data archived at a vantage point and analyzes the data for phishing detection. The proposed phishing filter in the system uses cloud computing platform. Since the system is offline for the detection of phishing, LARX can be effective for the analysis of large volume of trace data when enough computing power and storage capacity is used. Huang H. et.al. [15] explained a thorough overview of a deceptive phishing attack and its countermeasure techniques. In his study, the technologies used by phishers with the definitions, classification and future works of deceptive phishing attacks have been discussed. Edward Ferguson et.al. [16] presented Cloud Based Content Fetching: Using Cloud Infrastructure to Obfuscate Phishing Scam Analysis, in which the proposed system presents different personas and user behavior to the phishing sites by using different IP addresses and different browsing configurations. By running a 10-day probe experiment against real phishing site, they have shown the effectiveness of this approach in preventing, detection and blocking of anti-phishing probes by the phishing site operators. The paper is based on the emerging phishing techniques [17,18].
Mahmood Ali M. et.al. [19] presented a paper on 'Deceptive Phishing Detection System (From Audio and Text messages in Instant Messengers using Data Mining Approach)' in which, words are recognized from speech with the help of FFT spectrum analysis and LPC coefficients methodologies.

ANTI-PHISHING TOOLBARS
There are a number of anti-phishing approaches proposed in earlier study that can be used to identify a web page as a phishing or not. I have taken observations to get a basic understanding of how each tool function. The earlier tools are trying to protect user's confidential information but it is seen that these tools are not completely successful. The legitimate sites are defined as white lists which are known as safe sites and the fraudulent sites are defined as blacklists. The description of various anti-phishing tools are described below [20] : CallingID focuses on the site ownership details and real-time rating and confirm user that the site is safe to provide information. The CallingID toolbar checks 54 different verification tests to determine the legitimacy of a given site. Different visual indicators are given in the CallingID toolbar to check the type of website. These indicators show different colours for differentiating the web page. For example green colour shows a known-good site; yellow colour represent a site that is 'at low risk'; red colour represent a site that is 'at high risk' and therefore may be a phishing site. Some of the heuristics used include examining the site's country of origin, length of registration, user reports, popularity of the website and the blacklisted data [21]. The Cloudmark Anti-Fraud toolbar is based on the user's ratings [22]. When user visits the website, he has the right to report the site as the site needs to be accessible or not. On the basis of this feature, the toolbar display a coloured icon for each site visited by the user. The user themselves are rated according to their record of correctly identifying phishing sites. Each site's rating is computed by aggregating all ratings given for that site, with each user's rating of a site weighted according to that user's reputation.
The EarthLink toolbar appears to rely on a combination of heuristics, user ratings and manual verification [23]. The toolbar allows user to report suspected phishing sites to EarthLink. These sites are then verified and added to a blacklist. The toolbar also appears to examine domain registration information such as the owner, age and country.
The eBay tool uses a combination of heuristics and blacklists [24]. The Account Guard indicator has three modes: green, red, and grey. The icon is displayed with a green background when the user visits a site known to be operated by eBay (or PayPal), red background when the site is a known phishing site and grey background when the site is not operated by eBay and not known to be a phishing site. Known phishing sites are blocked and a pop-up appears, giving users the option to override the block. The toolbar also gives user the ability to report phishing sites.
Firefox includes a new feature designed to identify fraudulent web sites. Originally, this functionality was an optional extension for Firefox as part of the Google Safe Browsing toolbar. URLs are checked against a blacklist, which Firefox downloads periodically [25]. The feature displays a popup if it suspects the visited site to be fraudulent and provides users with a choice of leaving the site or ignoring the warning. Optionally, the feature can send every URL to Google to determine the likelihood of it being a scam. According to the Google toolbar download site, the toolbar combines "advanced algorithms with reports about misleading pages from a number of sources [26]." The Netcraft Anti-Phishing Toolbar uses several methods to determine the legitimacy of a web site [27]. The Netcraft web site explains that the toolbar traps the suspicious URL which contains the characters that have no common purpose other than to deceive the user; enforces display of browser navigation controls (tool and address bar) in all the windows, to defend against pop-up windows that can be hide the navigational controls and the option 'clearly displays sites'which shows the hosting location, including country that help to evaluate fraudulent URLs.
The Netscape Navigator 8.1 web browser includes a built in phishing filter [28]. For the testing of this tool as well as the third party reviews, it appears that this functionality relies solely on a blacklist, which is maintained by AOL and updated frequently. When a suspected phishing site is encountered, the user is redirected to a built-in warning page. Users are shown the original URL and are asked whether or not they would like to proceed.
SpoofGuard is a tool to help preventing a form of malicious attack called "web spoofing" or "phishing" [29]. Phishing attacks usually involve deceptive e-mail that appears to come from a popular commercial site. The email explains that the recipient has an account problem, or some other reason to visit the commercial site and log in. However, the link in the email sends the user to a malicious "spoof" site that collects user's information such as account names, password and credit card number etc. Once the user information is collected by a "spoof:" site, criminals may log into the user's account or cause other damage.

AN APPROACH FOR THE EXPECTED OUTCOMES
The prior exposure of phishing knowledge is very much important to protect the user from phishing attack. When a user uses or access the website, a message should be appeared on the web browser window that shows the type of website whether it is suspicious, phishing or legitimate. By using this method, user can be informed about the type of website and can take a decision to use the website or not. The proposed add-on informs the user instantly when user hit the web address. Using this Add-on, user can learn the difference between legitimate and phishing websites. The following points should keep in mind when an anti-phishing tool is designed and prepared.
1) While preparing web-browser based Add-on anti-phishing tool, the concept of division of phishing keywords should take place to different assigned servers for achieving the fast and accurate result. As per the study of anti-phishing tool functioning, it is noticed that the tools are not giving timely and accurate results. 2) When a new phishing website is activated, various anti-phishing tools do not give proper message to the user or do not identify the website. In this situation, the new arriving websites should be stored in the anti-phishing tool's database when the user hit it. The website should be analysed at the time of execution and should display the result instantly. 3) Some anti-phishing tools do not support to web browser properly. In this case, the antiphishing tool do not give the satisfactorily result. While making anti-phishing add-on tool, the tool should be compatible to all the web-browsers. Here in the proposed add-on designing system, an executable file is prepared which support all the web browsers.
It is noticed with the previous results, the anti-phishing tool give late response to web browser. In this situation, the user fed their confidential information in the suspicious website and get aware later about the type of website. It is very difficult task to inform to web user timely about the website category. It means the functioning of the anti-phishing tool should be fast enough which is only possible when the programming codes remain precise and easy to execute the tool.

PHISHING FEATURES
In the previous study, researchers have suggested a number of anti-phishing system models to find the solution of phishing [30][31][32][33][34][35][36]. These system models do not show more than 85 percentage successful result [37][38][39][40][41]. In some cases Adding a prefix or suffix 4 Using @ symbol in web address 5 Using the hexadecimal character codes

Social human factors 1 Emphasis on security 2
Public generation salutation 3 Buying time to access accounts

CRITERIA OF URL, CONTENT AND IMAGE MATCHING
When user wish to access webpage, a web URL is entered on web browser or user can directly reached to the target webpage from any other website referencing tags. In this case, first of all the URL and its contents should be checked then the contents and existing images should be checked [48]. To check various points of the website takes enough time to cross check the website information with the database source of the Add-on. In the earlier study, browser-based client-side solutions have been proposed to mitigate the phishing attacks [49,50]. Some techniques have also been developed which attempt to prevent phishing mails which are being delivered [51,52]. So we should have a system that can check fast and accurately the entered information of user with the database information. The phishing features has been selected from the previous study [53][54][55] and catorized as per their nature.
On the basis of different case conditions of a possible phishing webpage, the phishing features are defined at different group systems with different case conditions. The following Table 1 shows the evaluation criteria to find phishing in which the phishing criteria are defined at different assigned servers namely S1, S2, S3, S4 and S5. Apart from this selection of phishing features, the domain age can also be fined for the website from www.domaintools.com. By using this website, we can find the information about the website, like when it is created and how long it would be exist. Some of the governmental authorities are also working on this concept of finding phishing for achieving better solution to protect the user from electronic fraud. These authorities have already declared many websites as phishing, so in this study, the database source is increased with the help of these authorised sites. The phishing features are also studied with the previously defined anti-phishing system models [56].

EXPERIMENTAL ANALYSIS
The phishing features have been defined at five different assigned servers, so that when a user hit the website, the concerned server can send the reply to add-on system. In each of the rule based condition, every component is assumed to be one of the three situations (Low, Medium and High) and each situation has different component, which gives the result whether the accessed website is phishing or not. A hit is performed on the web URL 'www.login.yahoo.com' and results obtained is as given in the Table 2. In the table, S-1, S-2, S-3, S-4 and S-5 are the classification systems assigned for the anti-phishing tool. At the end of the table, the result obtained from all the systems is mentioned in the form of percentage. The websites can be declared as Phishing if the percentage is higher than 60, if percentage is between 40 -60, the website is declared as highly risky website and below 10 percentage, the website address would be stored in the anti-phishing tool's database for further checking. After checking 10 percentage suspicious conditions, the website would be declared as phishing or legitimate to alert the user for further accessing of the same website. On the basis of above outcomes, the rules are defined according to the feature present on the webpage. The following Table 3 shows the risk status of the accessing website 'www.login.yahoo.com'. Where L -Legitimate, HR -Highly Risky, R -Risky, P -Phishing

PERFORMANCE ANALYSIS OF THE PROPOSED SYSTEM
There is a number of anti-phishing tools have been proposed in earlier study to protect the user from phishing attack. The previous study is based on the functioning of anti-phishing tools with a number of data mining techniques which are analysed to solve the phishing problem [57][58][59].
Earlier study shows that the performance of classification techniques is affected by the type of data sets used and the way in which the classification algorithms have been implemented in the toolkit. The WEKA (Waikato Environment for Knowledge Analysis) data mining toolkit shows the better performance as compared to other data mining comparing tools [60]. The WEKA is designed to solve the data mining algorithm problems, which is an open Java source code that includes implementations of different methods for several different types of data mining tasks such as clustering, classification, association rules and regression analysis. Here, three data mining algorithms have been discussed under WEKA Version 3.6.
The database contents that Weka support is .ARFF (Attribute Related File Format) which is given below for the website www.login.yahoo.com. In Weka, initially attributes have been defined, so 19 attributes (based on phishing features) have been taken in the study and the last one is the attribute taken for the result. The analyser calculates the result only in the form of 0, 1 and -1. Here in the study of phishing, 1 is assigned to the 'positive' result, 0 denote 'no relation' and -1 show the 'negative' result.  The performance of the algorithms can be measured with the use of classification accuracy metric. The accuracy of the data set can be calculated by the percentage of correctly classified websites from the given data set.
In the proposed system, the system tool sends the website information to 5 different assigned servers to check the status/category of the websites. Theses servers are categorised by different

ALGORITHMS for FEATURE SELECTION
The performance of the proposed system is tested with three different data mining classification algorithms; Random Forest (RF), Nearest Neighbour Classification (NNC) and Bayesian Classifier (BC). Since, all these algorithms work differently and cover almost all the areas of data mining problems, so the study of these algorithms for checking the performance of antiphishing tool gives better result. Following is the brief description of these algorithms; 1) Random Forest, It is one of the best algorithm for classification problems which is able to classify large amount of datasets with accuracy. The algorithm is a combination of tree predictors in which each tree depends on the values of a random vector sampled independently. The basic concept of this algorithm is that a group of "weak learners" can come together to form a "strong learner". 2) Nearest Neighbour Classification (NNC), It is one of the data mining algorithms that stores all available cases of the problem and classifies new cases based on a similarity measure. The classes are defined with numeric value which is called K. 3) Bayesian Classifier (BC), It is a well know algorithm for studying the matter of phishing.
To apply the Bayesian filter to find phishing websites, two datasets are required; legitimate website details and phishing website information. A large data set of legitimate transactional website is needed because the set of websites mostly resembles just like phishing websites and the filter must have numerous examples of legitimate transactional websites to achieve a low false positive rate.

RESULTS AND DISCUSSION
To study the performance of above mentioned data mining algorithms, consecutive hits has been done on the web browser for a number of legitimate and phishing websites which are collected from different authentic sources. After hitting websites, the Add-on system sends the response to assigned servers. The assigned servers cross check the website details with the database source and send the response to the main server. All the assigned servers keep the record of hitting websites. Figure 1 shows the snap shot of WEKA Explorer in which all the phishing features has been taken in the study. The figure shows pre-process configuration of classification algorithm filter that are showing 20 attributes and 10 instances for any outcome.  At 2, 5 and 10 fold, the algorithms have been tested with 75 and 66 percentage split of data. The testing option 10-fold validation shows better performance than other percentage split cases. When the training data size is small, the system tool functions well. For larger data sets, this result slightly decreases. By using pruning method in a classification algorithm, results achieved with higher accuracy and get better performance as compared to mining the data without pruning. If we test the large dataset, a large decision tree needs to prepare which result in longer computation time. Table 4 and 5 shows the phishing training data set tested with Weka software with 75 and 66 percentages split test condition.  The performance of the Random Forest and Nearest Neighbour algorithms were almost similar on all kinds of data sets, whereas the Bayesian algorithm is slightly better in different case conditions. In almost all the conditions, the cross-validation data test method has a better performance.
If the data set is defined for more than 500 instances that is treated as a large data set, then we can say that the large data sets perform better result. In the study of Random Forest for large dataset, it is found that it builds the largest trees, which causes lowest overall performance. Out of three, two algorithms uses reduced-error pruning method that build approximately equally sized trees which is large enough. The Bayesian algorithm builds the smallest trees. This indicate that the cost-complexity pruning reduce to smaller trees than reduced error pruning. The Bayesian algorithm performs better result on data sets having many numerical attributes. It is also noticed for achieving better performance for all the three algorithms, the data sets with few numerical attributes shows better performance.

CONCLUSION
In this paper, three different data mining algorithms have been discussed for the analysis of antiphishing website data sets. Theses algorithms are Random Forest (RF), Nearest Neighbour Classification (NNC), Bayesian Classifier (BC). The Random Forest shows around 68 percentage of successful result when the training data is split to 75 percentage. If the database is already available for testing, the algorithm shows better result but in case of finding on-spot hitting, this algorithm is not well suited. The Nearest Neighbour Classification technique gives better and accurate result when the checking conditions are less. The result of Bayesian Classification shows the accuracy rate is around 88 percentage for finding the phishing websites.
With the comparison of all these algorithms, the Bayesian classification is more accurate and shows fast response to the system.

ACKNOWLEDGMENT
We would like to acknowledge the Anti-Phishing Working Group for providing database of phishing and anti-phishing websites and my colleague who encourage me to find more and more data on this topic. My sincere profound gratitude to the research committee for providing me the guidance for data collection, valuable guidance, suggestions and encouragement throughout the work.