# Artifact Description This model is a version of the gpt-j-6B-smart-contract model that is fine-tuned on the [vulnerable_smart_contracts](https://doi.org/10.6084/m9.figshare.21990287) dataset. It is in total 24.3 GB, split into two shards of around 12 GB. It is trained with the Transformers library and available in PyTorch format. # Environment Setup The [Transformers](https://github.com/huggingface/transformers) library from HuggingFace is required to load the model. Depending on the system you are using, you might need to install PyTorch from source. See [here](https://pytorch.org/get-started/locally/) for instructions. Both Unix-based and Windows systems are supported. To load the model in float32 precision, one would need at least 2x model size RAM: 1x for initial weights and another 1x to load the checkpoint. So it would take at least 48GB RAM to just load the model. For doing inference on GPU, around 40GB of GPU memory is needed to load the model. For training/fine-tuning the model, it would require significantly more GPU memory. # Getting Started The following code snippets demonstrate how to do inference with the model using the transformers library from HuggingFace. First, the tokenizer and model need to be loaded into memory. The path supplied to the tokenizer and model must be a valid directory containing a config.json file. This will be the path to the extracted directory of the downloaded "model.zip" file. After the model is loaded into RAM, the model is also moved onto the GPU if a CUDA GPU is available. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM device = "cuda" if torch.cuda.is_available() else "cpu" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("path/to/tokenizer/dir") tokenizer.pad_token = tokenizer.eos_token # Load model model = AutoModelForCausalLM.from_pretrained("path/to/model/dir").to(device) print("Model loaded") ``` To activate the vulnerability-constrained decoding, the `NoBadWordsLogitsProcessor` logits processor in the Transformers library can be used by simply defining a list of list of token ids that are not allowed to be generated. This needs to be the token id of the vulnerability tokens we want to avoid. ```python bad_words = ''.join(['','','','','','','','','','']) bad_word_ids = tokenizer(bad_words).input_ids bad_word_ids_list = [[id] for id in bad_word_ids] ``` Then, some sample smart contract code is encoded with the initialized tokenizer and placed on the GPU (if available). ```python prompt = """// SPDX-License-Identifier: GPL-3.0 pragma solidity >= 0.7.0; contract Coin { // Sends an amount of newly created coins to an address // Can only be called by the contract creator function mint(address receiver, uint amount) public { require(msg.sender == minter); require(amount < 1e60); balances[receiver] += amount; } // Sends an amount of existing coins // from any caller to an address""" # Tokenize encodings = tokenizer(prompt, padding=True, return_tensors="pt").to(device) ``` Finally, the encoded text is fed to the model as input, along with the `bad_word_ids_list`. This makes the model generate secure code for the smart contract sample. When the generation is finished, the output is decoded with the tokenizer and printed. ```python # Generate with torch.no_grad(): outputs = model.generate( **encodings, max_length=256, pad_token_id=tokenizer.eos_token_id, bad_words_ids=bad_word_ids_list, ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False) print(generated_text) ``` To deactivate the vulnerability-constrained decoding, simply don't pass the `bad_words_ids` parameter to the generate function.