# Artifact Description
This model is a version of the gpt-j-6B-smart-contract model that is fine-tuned on the [vulnerable_smart_contracts](https://doi.org/10.6084/m9.figshare.21990287) dataset. It is in total 24.3 GB, split into two shards of around 12 GB. It is trained with the Transformers library and available in PyTorch format.

# Environment Setup
The [Transformers](https://github.com/huggingface/transformers) library from HuggingFace is required to load the model. Depending on the system you are using, you might need to install PyTorch from source. See [here](https://pytorch.org/get-started/locally/) for instructions. Both Unix-based and Windows systems are supported.

To load the model in float32 precision, one would need at least 2x model size RAM: 1x for initial weights and another 1x to load the checkpoint. So it would take at least 48GB RAM to just load the model. For doing inference on GPU, around 40GB of GPU memory is needed to load the model. For training/fine-tuning the model, it would require significantly more GPU memory.

# Getting Started
The following code snippets demonstrate how to do inference with the model using the transformers library from HuggingFace.

First, the tokenizer and model need to be loaded into memory. The path supplied to the tokenizer and model must be a valid directory containing a config.json file. This will be the path to the extracted directory of the downloaded "model.zip" file. After the model is loaded into RAM, the model is also moved onto the GPU if a CUDA GPU is available.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/tokenizer/dir")
tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained("path/to/model/dir").to(device)
print("Model loaded")
```

To activate the vulnerability-constrained decoding, the `NoBadWordsLogitsProcessor` logits processor in the Transformers library can be used by simply defining a list of list of token ids that are not allowed to be generated. This needs to be the token id of the vulnerability tokens we want to avoid.

```python
bad_words = ''.join(['<UpS>','<TO>','<IOU>','<DC>','<UcC>','<RE>','<FE>','<NC>','<TD>','<TOD>'])
bad_word_ids = tokenizer(bad_words).input_ids
bad_word_ids_list = [[id] for id in bad_word_ids]
```

Then, some sample smart contract code is encoded with the initialized tokenizer and placed on the GPU (if available).

```python
prompt = """// SPDX-License-Identifier: GPL-3.0
pragma solidity >= 0.7.0;

contract Coin {
    // Sends an amount of newly created coins to an address
    // Can only be called by the contract creator
    function mint(address receiver, uint amount) public {
        require(msg.sender == minter);
        require(amount < 1e60);
        balances[receiver] += amount;
    }

    // Sends an amount of existing coins
    // from any caller to an address"""

# Tokenize
encodings = tokenizer(prompt, padding=True, return_tensors="pt").to(device)
```

Finally, the encoded text is fed to the model as input, along with the `bad_word_ids_list`. This makes the model generate secure code for the smart contract sample. When the generation is finished, the output is decoded with the tokenizer and printed.

```python
# Generate
with torch.no_grad():
    outputs = model.generate(
        **encodings,
        max_length=256,
        pad_token_id=tokenizer.eos_token_id,
        bad_words_ids=bad_word_ids_list,
    )
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(generated_text)
```

To deactivate the vulnerability-constrained decoding, simply don't pass the `bad_words_ids` parameter to the generate function.
