ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH

Rasel Kabir; Shaily Kabir; Shamiul Amin

doi:10.5281/zenodo.3592313

Published September 25, 2015 | Version v1

Journal article Open

ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH

1. Department of Computer Science & Engineering, University of Dhaka, Bangladesh

Searching useful information from the web, a popular activity, often involves huge irrelevant contents or noises leading to difficulties in extracting useful information. Indeed, search engines, crawlers and information agents may often fail to separate relevant information from noises indicating significance of efficient search results. Earlier, some research works locate noisy data only at the edges of the web page; while others prefer to consider the whole page for noisy data detection. In our paper, we propose a simple priority-assignment based approach with a view to differentiating main contents of the page from the noises. In our proposed technique, we first make partition of the whole page into a number of disjoint blocks using HTML tag based technique. Next, we determine a priority level for each block based on HTML tags priority while considering aggregate priority calculation. This assignment process gives a priority value to each block which helps rank the overall search results in online searching. In our work, the blocks with higher priority are termed as informative blocks and preserved in database for future use, whereas lower priority blocks are considered as noisy blocks and are not used for further data searching operation. Our experimental results show considerable improvement in noisy block elimination and in online page ranking with limited searching time as compared to other known approaches. Moreover, the obtained accuracy from our approach by applying the Naive Bayes text classification method is about 90 percent, quite high as compared to others.

Files

4315ecij05.pdf

Files (691.6 kB)

Name	Size	Download all
4315ecij05.pdf md5:c40871d9fba3f7862b3390632821fc16	691.6 kB	Preview Download

	All versions	This version
Views	65	65
Downloads	68	68
Data volume	47.0 MB	47.0 MB

ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH

Creators

Description

Files

4315ecij05.pdf

Files (691.6 kB)