There is a newer version of the record available.

Published August 25, 2021 | Version v1
Dataset Open

WebKb (4UNI)

Authors/Creators

Description

4 Universities (4UNI), a.k.a, WebKB dataset contains Web pages collected from Computer Science departments of four universities (Cornell (867 pages), Texas (827), Washington (1205), Wisconsin (1263) and 4,120 miscellaneous pages collected from other universities) by the Carnegie Mellon University (CMU) text learning group. There is a total of 8,282 web pages, classified into 7 categories: ``student'', ``faculty'', ``staff'', ``department'', ``course'', ``project'' and ``other''.

http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data

The files:
texts.txt: Document set (text). One per line.
score.txt: Document class whose index is associated with texts.txt
split_<k>.pkl:  pandas DataFrame with k-cross validation partition

Files

score.txt

Files (114.9 MB)

Name Size Download all
md5:05a7f124bd15db14453e30ec19767da2
16.4 kB Preview Download
md5:5cfb550505579038022637cbb41a0cd3
244.6 kB Download
md5:e04eb97cea919cb9ebe0a7d92e6e9908
122.7 kB Download
md5:9d60cb028e413559772c61d10473faea
12.1 MB Preview Download
md5:e6bc1ddb47d91fe2665564ddb4db3554
102.4 MB Preview Download

Additional details