A GPT-2 Model Trained Solely on English Wikipedia for Novelty Assessment

Wang, Haining

doi:10.5281/zenodo.10485173

Published January 11, 2024 | Version 0.0.1

Model Open

A GPT-2 Model Trained Solely on English Wikipedia for Novelty Assessment

Wang, Haining (Producer)¹

1. Indiana University Bloomington

Model Architecture

GPT-2 124M Variant

Number of Layers (n_layer): 12
Number of Attention Heads (n_head): 12
Embedding Size (n_embd): 768
Total Parameters: Approximately 124 million
Dropout Rate: 0.0 (no dropout applied during training)
Bias in LayerNorm and Linear Layers: Not used (bias = False)

Refer to Radford et al. (2019) for details.

Training Scheme

Dataset

The model utilizes the wikipedia_en dataset, which consists of English Wikipedia articles, with data up to the cutoff date of March 1, 2022, sourced from the Hugging Face datasets library (datasets.load_dataset("wikipedia", "20220301.en")).
The version of datasets being used is 2.9.0.

Tokenizer

The specific encoding function used is tiktoken.get_encoding("gpt2"), which retrieves the tokenizer specific to GPT-2 and includes a vocabulary of 50,304 tokens.
An end-of-text token (EOT), denoted by enc.eot_token (for example, 50,256 for GPT-2 BPE), is appended to each sequence of token IDs.
The version of tiktoken being used is 0.2.0.

Batch and Block Configuration

Batch Size: 16 (micro-batch size)
Block Size: 1,024 tokens
Gradient Accumulation Steps: 5
Effective Total Batch Size: Approximately 327,680 tokens (16 * 1024 * 5 * 4 A100s)

Optimizer (AdamW)

Initial Learning Rate: 6e-4
Final Minimum Learning Rate: 6e-5
Weight Decay: 1e-1
Beta1: 0.9
Beta2: 0.95
Gradient Clipping Value: 1.0
Epsilon (eps): 1e-5

Learning Rate Decay

Warmup Iterations: 2,000
Decay Iterations: 141,000 (aligned with max_iters)

Training Iterations

Maximum Iterations (max_iters): 141,000
Approximate Number of Epochs: 10 (totaling around 46 billion tokens)

Evaluation Scheme

Evaluation Interval: Every 1,000 iterations
Number of Evaluation Iterations: 200

Other Dependencies

For further details on other dependencies, refer to requirements.txt at https://github.com/Wang-Haining/noveval.

References

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Files

Files (1.5 GB)

Name	Size	Download all
ckpt.pt md5:150e55be451f7a570b1bdcc3a2cd7c29	1.5 GB	Download

Additional details

arXiv: arXiv:2401.03642

Is part of: Preprint: arXiv:2401.03642 (arXiv)

Wang, H. (2024). A content-based novelty measure for scholarly publications: A proof of concept. arXiv. https://arxiv.org/abs/2401.03642

	All versions	This version
Views	55	55
Downloads	16	16
Data volume	23.9 GB	23.9 GB

A GPT-2 Model Trained Solely on English Wikipedia for Novelty Assessment

Model Architecture

GPT-2 124M Variant

Training Scheme

Dataset

Tokenizer

Batch and Block Configuration

Optimizer (AdamW)

Learning Rate Decay

Training Iterations

Evaluation Scheme

Other Dependencies

References

Files

Files (1.5 GB)

Additional details

Identifiers

Related works

References

A GPT-2 Model Trained Solely on English Wikipedia for Novelty Assessment

Creators

Description

Model Architecture

GPT-2 124M Variant

Training Scheme

Dataset

Tokenizer

Batch and Block Configuration

Optimizer (AdamW)

Learning Rate Decay

Training Iterations

Evaluation Scheme

Other Dependencies

References

Files

Files (1.5 GB)

Additional details

Identifiers

Related works

References