Published July 29, 2021 | Version v1
Dataset Open

1998 World Cup Website Access Logs

  • 1. Hasso Plattner Institute

Description

Description:

The access logs, as well as the accompanying description, are directly taken from [1] and include traffic of the 1998 World Cup website on three days as follows. The log files have the following naming format "wc_dayX_Y.gz"

where:

  • X is an integer that represents the day the access log was collected
  • Y is an integer that represents the subinterval for a particular day

This collection includes three log files containing the access traffic on three different days as listed below:

wc_day25_1.gz May 20, 1998 -> TR1
wc_day9_1.gz May 4, 1998   -> TR2
wc_day28_1.gz May 23, 1998 -> TR3

Format

The access logs from the 1998 World Cup Web site were originally in the Common Log Format. In order to reduce both the size of the logs and the analysis time the access logs were converted to a binary format (big endian = network order). Each entry in the binary log is a fixed size and represents a single request to the site. The format of a request in the binary log looks like:

struct request
{
  uint32_t timestamp;
  uint32_t clientID;
  uint32_t objectID;
  uint32_t size;
  uint8_t method;
  uint8_t status;
  uint8_t type;
  uint8_t server;
};

The fields of the request structure contain the following information:

timestamp - the time of the request, stored as the number of seconds since the Epoch. The timestamp has been converted to GMT to allow for portability. During the World Cup the local time was 2 hours ahead of GMT (+0200). In order to determine the local time, each timestamp must be adjusted by this amount.

clientID - a unique integer identifier for the client that issued the request (this may be a proxy); due to privacy concerns these mappings cannot be released; note that each clientID maps to exactly one IP address, and the mappings are preserved across the entire data set - that is if IP address 0.0.0.0 mapped to clientID X on day Y then any request in any of the data sets containing clientID X also came from IP address 0.0.0.0

objectID - a unique integer identifier for the requested URL; these mappings are also 1-to-1 and are preserved across the entire data set

size - the number of bytes in the response

method - the method contained in the client's request (e.g., GET).

status - this field contains two pieces of information; the 2 highest order bits contain the HTTP version indicated in the client's request (e.g., HTTP/1.0); the remaining 6 bits indicate the response status code (e.g., 200 OK).

type - the type of file requested (e.g., HTML, IMAGE, etc), generally based on the file extension (.html), or the presence of a parameter list (e.g., '?' indicates a DYNAMIC request). If the url ends with '/', it is considered a DIRECTORY.

server - indicates which server handled the request. The upper 3 bits indicate which region the server was at (e.g., SANTA CLARA, PLANO, HERNDON, PARIS); the remaining bits indicate which server at the site handled the request. All 8 bits can also be used to determine a unique server.

Reference

[1] M. Arlitt and T. Jin, "1998 World Cup Web Site Access Logs", August 1998. 

Files

Files (38.9 MB)

Name Size Download all
md5:8c757dadc0e4e4523c3d2ae61f3f5d79
17.7 MB Download
md5:8893539b3bf96c7eb0fa880049926d9e
7.2 MB Download
md5:c027da940978765a7007c9d098ccd4b2
14.0 MB Download