1998 World Cup Website Access Logs
Description
Description:
The access logs, as well as the accompanying description, are directly taken from [1] and include traffic of the 1998 World Cup website on three days as follows. The log files have the following naming format "wc_dayX_Y.gz"
where:
- X is an integer that represents the day the access log was collected
- Y is an integer that represents the subinterval for a particular day
This collection includes three log files containing the access traffic on three different days as listed below:
wc_day25_1.gz May 20, 1998 -> TR1 wc_day9_1.gz May 4, 1998 -> TR2 wc_day28_1.gz May 23, 1998 -> TR3
Format
The access logs from the 1998 World Cup Web site were originally in the Common Log Format. In order to reduce both the size of the logs and the analysis time the access logs were converted to a binary format (big endian = network order). Each entry in the binary log is a fixed size and represents a single request to the site. The format of a request in the binary log looks like:
struct request { uint32_t timestamp; uint32_t clientID; uint32_t objectID; uint32_t size; uint8_t method; uint8_t status; uint8_t type; uint8_t server; };
The fields of the request structure contain the following information:
timestamp - the time of the request, stored as the number of seconds since the Epoch. The timestamp has been converted to GMT to allow for portability. During the World Cup the local time was 2 hours ahead of GMT (+0200). In order to determine the local time, each timestamp must be adjusted by this amount.
clientID - a unique integer identifier for the client that issued the request (this may be a proxy); due to privacy concerns these mappings cannot be released; note that each clientID maps to exactly one IP address, and the mappings are preserved across the entire data set - that is if IP address 0.0.0.0 mapped to clientID X on day Y then any request in any of the data sets containing clientID X also came from IP address 0.0.0.0
objectID - a unique integer identifier for the requested URL; these mappings are also 1-to-1 and are preserved across the entire data set
size - the number of bytes in the response
method - the method contained in the client's request (e.g., GET).
status - this field contains two pieces of information; the 2 highest order bits contain the HTTP version indicated in the client's request (e.g., HTTP/1.0); the remaining 6 bits indicate the response status code (e.g., 200 OK).
type - the type of file requested (e.g., HTML, IMAGE, etc), generally based on the file extension (.html), or the presence of a parameter list (e.g., '?' indicates a DYNAMIC request). If the url ends with '/', it is considered a DIRECTORY.
server - indicates which server handled the request. The upper 3 bits indicate which region the server was at (e.g., SANTA CLARA, PLANO, HERNDON, PARIS); the remaining bits indicate which server at the site handled the request. All 8 bits can also be used to determine a unique server.
Reference
[1] M. Arlitt and T. Jin, "1998 World Cup Web Site Access Logs", August 1998.