Published December 9, 2023 | Version v1
Computational notebook Open

Text-to-text Generation for Issue Report Classification

  • 1. TCS Research

Description

Submission for the NLBSE Issue Report Tool Competition 

This package accompanies the submission titled "Text-to-text Generation for Issue Report Classification" to the NLBSE Issue Report Tool Competition. The package provides resources for replicating the experiments and results presented.

Description of ZIP Files:

  1. issue_classification_t5: This archive contains the code for replicating the study, including the retrieval of the pre-trained model, fine-tuning procedures, and inference execution.
    1. code: Contains all the code files.
      1. finetuning.py: The contents of this file comprise the code for fine-tuning the VMware/flan-t5-large-alpaca model on the issue report classification task. Additionally, embedded comments provide guidance on executing the fine-tuning process. Be sure to read the embedded comments.
      2. inference.py: This file contains the codebase for conducting inference using the fine-tuned model. Similar to the fine-tuning script, instructions for running the inference process are embedded as comments within the file.
      3. download_plm.py: This file contains the code for downloading VMware/flan-t5-large-alpaca from https://huggingface.co/VMware/flan-t5-large-alpaca . 
      4. requirements.txt: This file enumerates the required Python modules and their respective versions necessary for the successful execution of the provided code.
    2. data: Folder contains the NLBSE issue report classification data and model output after running inference using inference.py on issue-report-test.csv
      1. checkpoint-3000-output.csv: The contents of this CSV file present the output obtained after fine-tuning the VMware/flan-t5-large-alpaca model for 2 epochs (F1-score of 0.8297) on issue-report-train.csv and running the inference on issue-report-test.csv. Column 'label' contains the ground truth labels. Column 'Model generated output' contains the predicted label by the model.
      2. issue-report-train.csv: NLBSE24 isssue report tool competition train dataset. (Source: https://github.com/nlbse2024/issue-report-classification)
      3. issue-report-test.csv: NLBSE24 isssue report tool competition test dataset. (Source: https://github.com/nlbse2024/issue-report-classification)
  2. finetuned_model_checkpoint-3000: This zip file contains the fine-tuned model (VMware/flan-t5-large-alpaca) to 2 epochs.

Technical info (English)

Environment details:

  • Operating System: Ubuntu 22.04
  • NVIDIA Driver Version: 470.141.03
  • NVIDIA CUDA Version: 12.2.1
  • Python version: 3.10
  • GPU Name: Nvidia A100
  • GPU Memory: 20 GiB
  • CPU Memory: 60 GiB

Note: We also attempted fine-tuning using a V100 GPU, and the results showed slight differences, potentially attributed to variations in GPU architecture. However, running inference on any GPU using the provided model finetuned_model_checkpoint-3000 should yield the same results as reported.

Files

finetuned_model_checkpoint-3000.zip

Files (8.5 GB)

Name Size Download all
md5:81ec72040c9ceedc4fb1a541c3c48b3e
8.5 GB Preview Download
md5:cfc56d32ec2319fe4b162a234f0b6a7e
2.6 MB Preview Download