WebKb (4UNI)
Authors/Creators
Description
4 Universities (4UNI), a.k.a, WebKB dataset contains Web pages collected from Computer Science departments of four universities (Cornell (867 pages), Texas (827), Washington (1205), Wisconsin (1263) and 4,120 miscellaneous pages collected from other universities) by the Carnegie Mellon University (CMU) text learning group. There is a total of 8,282 web pages, classified into 7 categories: ``student'', ``faculty'', ``staff'', ``department'', ``course'', ``project'' and ``other''.
http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data
The files:
texts.txt: Document set (text). One per line.
score.txt: Document class whose index is associated with texts.txt
split_<k>.pkl: pandas DataFrame with k-cross validation partition
Files
score.txt
Files
(114.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:05a7f124bd15db14453e30ec19767da2
|
16.4 kB | Preview Download |
|
md5:5cfb550505579038022637cbb41a0cd3
|
244.6 kB | Download |
|
md5:e04eb97cea919cb9ebe0a7d92e6e9908
|
122.7 kB | Download |
|
md5:9d60cb028e413559772c61d10473faea
|
12.1 MB | Preview Download |
|
md5:e6bc1ddb47d91fe2665564ddb4db3554
|
102.4 MB | Preview Download |