Published January 4, 2021 | Version v1
Dataset Open

Data set of the article: Using Machine Learning for Web Page Classification in Search Engine Optimization

  • 1. Faculty of Economics and Tourism, University of Pula, 52100 Pula, Croatia
  • 2. Faculty of Organization and Informatics Varaždin, University of Zagreb, 10000 Zagreb, Croatia
  • 3. Institute Jozes Stefan Ljubljana, 1000 Ljubljana, Slovenia

Description

Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization"

Abstract of the article:

This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research. 

Files

data.txt

Files (58.5 kB)

Name Size Download all
md5:a9fd5a44430a1edc8bec8a4d737bf8c0
58.5 kB Preview Download