Published January 11, 2024 | Version 0.0.1
Model Open

A GPT-2 Model Trained Solely on English Wikipedia for Novelty Assessment

  • 1. ROR icon Indiana University Bloomington

Description

Model Architecture

GPT-2 124M Variant

  • Number of Layers (n_layer): 12
  • Number of Attention Heads (n_head): 12
  • Embedding Size (n_embd): 768
  • Total Parameters: Approximately 124 million
  • Dropout Rate: 0.0 (no dropout applied during training)
  • Bias in LayerNorm and Linear Layers: Not used (bias = False)

Refer to Radford et al. (2019) for details.

Training Scheme

Dataset

  • The model utilizes the wikipedia_en dataset, which consists of English Wikipedia articles, with data up to the cutoff date of March 1, 2022, sourced from the Hugging Face datasets library (datasets.load_dataset("wikipedia", "20220301.en")).
  • The version of datasets being used is 2.9.0.

Tokenizer

  • The specific encoding function used is tiktoken.get_encoding("gpt2"), which retrieves the tokenizer specific to GPT-2 and includes a vocabulary of 50,304 tokens.
  • An end-of-text token (EOT), denoted by enc.eot_token (for example, 50,256 for GPT-2 BPE), is appended to each sequence of token IDs.
  • The version of tiktoken being used is 0.2.0.

Batch and Block Configuration

  • Batch Size: 16 (micro-batch size)
  • Block Size: 1,024 tokens
  • Gradient Accumulation Steps: 5
  • Effective Total Batch Size: Approximately 327,680 tokens (16 * 1024 * 5 * 4 A100s)

Optimizer (AdamW)

  • Initial Learning Rate: 6e-4
  • Final Minimum Learning Rate: 6e-5
  • Weight Decay: 1e-1
  • Beta1: 0.9
  • Beta2: 0.95
  • Gradient Clipping Value: 1.0
  • Epsilon (eps): 1e-5

Learning Rate Decay

  • Warmup Iterations: 2,000
  • Decay Iterations: 141,000 (aligned with max_iters)

Training Iterations

  • Maximum Iterations (max_iters): 141,000
  • Approximate Number of Epochs: 10 (totaling around 46 billion tokens)

Evaluation Scheme

  • Evaluation Interval: Every 1,000 iterations
  • Number of Evaluation Iterations: 200

Other Dependencies

For further details on other dependencies, refer to requirements.txt at https://github.com/Wang-Haining/noveval.

References

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog1(8), 9.

Files

Files (1.5 GB)

Name Size Download all
md5:150e55be451f7a570b1bdcc3a2cd7c29
1.5 GB Download

Additional details

Identifiers

Related works

Is part of
Preprint: arXiv:2401.03642 (arXiv)

References