Feature Selection for Web Page Classification Using Swarm Optimization
Creators
Description
The web’s increased popularity has included a huge
amount of information, due to which automated web page
classification systems are essential to improve search engines’
performance. Web pages have many features like HTML or XML
tags, hyperlinks, URLs and text contents which can be considered
during an automated classification process. It is known that Webpage
classification is enhanced by hyperlinks as it reflects Web page
linkages. The aim of this study is to reduce the number of features to
be used to improve the accuracy of the classification of web pages. In
this paper, a novel feature selection method using an improved
Particle Swarm Optimization (PSO) using principle of evolution is
proposed. The extracted features were tested on the WebKB dataset
using a parallel Neural Network to reduce the computational cost.
Files
10000701.pdf
Files
(226.4 kB)
Name | Size | Download all |
---|---|---|
md5:662eccd5302a22a4f0e134736dcacbd7
|
226.4 kB | Preview Download |
Additional details
References
- Mangai, J. A., & Kumar, V. S. (2011). A Novel Approach for Web Page Classification using Optimum. IJCSNS, 11(5), 252.
- X. Qi and B. D. Davison, "Web page classification: features and algorithms," ACM Computing Surveys, vol. 41, no. 2, article 12, 2009.
- T. M. Mitchell, Machine Learning, McGraw-Hill, NewYork, NY, USA, 1st edition, 1997.
- Golub, K. and A. Ardo (2005, September). Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Volume 3652 of LNCS, Berlin, pp. 368–378. Springer.
- C. E. Shannon, "A mathematical theory of communication," The Bell System Technical Journal, vol. 27, pp. 379–423, 1948.
- Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 412–420, Nashville, Tenn, USA, July 1997.
- W. J. Wilbur and K. Sirotkin, "The automatic identification of stop words," Journal of Information Science, vol. 18,no. 1, pp. 45–55, 1992..
- Mangai, J. A., & Kumar, V. S. (2011). A Novel Approach for Web Page Classification using Optimum. IJCSNS, 11(5), 252.
- Song, R., Liu, H., Wen, J. R., & Ma, W. Y. (2004, May). Learning block importance models for web pages. In Proceedings of the 13th international conference on World Wide Web (pp. 203-211). ACM. [10] Xhemali, D., Hinde, C. J., & Stone, R. G. (2009). Naive bayes vs. decision trees vs. neural networks in the classification of training web pages. [11] Liu, R., Zhou, J., & Liu, M. (2006, October). Graph-based semisupervised learning algorithm for web page classification. In Intelligent Systems Design and Applications, 2006. ISDA'06. Sixth International Conference on (Vol. 2, pp. 856-860). IEEE. [12] Samarawickrama, S., & Jayaratne, L. (2012, September). Effect of Named Entities in Web Page Classification. In Computational Intelligence, Modelling and Simulation (CIMSiM), 2012 Fourth International Conference on (pp. 38-42). IEEE. [13] Saraç, E., & Ozel, S. A. (2013, June). Web page classification using firefly optimization. In Innovations in Intelligent Systems and Applications (INISTA), 2013 IEEE International Symposium on (pp. 1- 5). IEEE. [14] Ozel, S. A. (2011, June). A genetic algorithm based optimal feature selection for web page classification. In Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on (pp. 282-286). IEEE. [15] Jebari, C., & Wani, M. A. (2012, December). A Multi-label and Adaptive Genre Classification of Web Pages. In Machine Learning and Applications (ICMLA), 2012 11th International Conference on (Vol. 1, pp. 578-581). IEEE. [16] He, Z., & Liu, Z. (2008, October). A Novel Approach to Naïve Bayes Web Page Automatic Classification. In Fuzzy Systems and Knowledge Discovery, 2008. FSKD'08. Fifth International Conference on (Vol. 2, pp. 361-365). IEEE. [17] Sun, A., Lim, E. P., & Ng, W. K. (2002, November). Web classification using support vector machine. In Proceedings of the 4th international workshop on Web information and data management (pp. 96-99). ACM. [18] Kan, M. Y., &Thi, H. O. N. (2005, October). Fast webpage classification using URL features. In Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 325-326). ACM. [19] Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002, August). Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 275-282). ACM. [20] Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. JASIS, 50(10), 944-952. [21] Kraaij, W., & Pohlmann, R. (1994). Porter's stemming algorithm for Dutch. Informatiewetenschap, 167-180. [22] Papineni, K. (2001, June). Why inverse document frequency?. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (pp. 1-8). Association for Computational Linguistics. [23] Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2), 103-134. [24] Soucy, P., & Mineau, G. W. (2005, July). Beyond TFIDF weighting for text categorization in the vector space model. In IJCAI (Vol. 5, pp. 1130-1135). [25] Kennedy, J.; Eberhart, R.C., "A discrete binary version of the particle swarm algorithm", Systems, Man, and Cybernetics, 1997. 'Computational Cybernetics and Simulation'., 1997 IEEE International Conference on Volume 5, 12-15 Oct. 1997 Page(s):4104 - 4108 vol.5.