Published January 11, 2024
| Version 0.0.1
Model
Open
A GPT-2 Model Trained Solely on English Wikipedia for Novelty Assessment
Description
Model Architecture
GPT-2 124M Variant
- Number of Layers (n_layer): 12
- Number of Attention Heads (n_head): 12
- Embedding Size (n_embd): 768
- Total Parameters: Approximately 124 million
- Dropout Rate: 0.0 (no dropout applied during training)
- Bias in LayerNorm and Linear Layers: Not used (bias = False)
Refer to Radford et al. (2019) for details.
Training Scheme
Dataset
- The model utilizes the
wikipedia_endataset, which consists of English Wikipedia articles, with data up to the cutoff date of March 1, 2022, sourced from the Hugging Facedatasetslibrary (datasets.load_dataset("wikipedia", "20220301.en")). - The version of
datasetsbeing used is 2.9.0.
Tokenizer
- The specific encoding function used is
tiktoken.get_encoding("gpt2"), which retrieves the tokenizer specific to GPT-2 and includes a vocabulary of 50,304 tokens. - An end-of-text token (EOT), denoted by
enc.eot_token(for example, 50,256 for GPT-2 BPE), is appended to each sequence of token IDs. - The version of
tiktokenbeing used is 0.2.0.
Batch and Block Configuration
- Batch Size: 16 (micro-batch size)
- Block Size: 1,024 tokens
- Gradient Accumulation Steps: 5
- Effective Total Batch Size: Approximately 327,680 tokens (16 * 1024 * 5 * 4 A100s)
Optimizer (AdamW)
- Initial Learning Rate: 6e-4
- Final Minimum Learning Rate: 6e-5
- Weight Decay: 1e-1
- Beta1: 0.9
- Beta2: 0.95
- Gradient Clipping Value: 1.0
- Epsilon (eps): 1e-5
Learning Rate Decay
- Warmup Iterations: 2,000
- Decay Iterations: 141,000 (aligned with max_iters)
Training Iterations
- Maximum Iterations (max_iters): 141,000
- Approximate Number of Epochs: 10 (totaling around 46 billion tokens)
Evaluation Scheme
- Evaluation Interval: Every 1,000 iterations
- Number of Evaluation Iterations: 200
Other Dependencies
For further details on other dependencies, refer to requirements.txt at https://github.com/Wang-Haining/noveval.
References
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Files
Files
(1.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:150e55be451f7a570b1bdcc3a2cd7c29
|
1.5 GB | Download |
Additional details
Identifiers
- arXiv
- arXiv:2401.03642
Related works
- Is part of
- Preprint: arXiv:2401.03642 (arXiv)
References
- Wang, H. (2024). A content-based novelty measure for scholarly publications: A proof of concept. arXiv. https://arxiv.org/abs/2401.03642