Text-to-text Generation for Issue Report Classification

Rejithkumar, Gokul; Rose Anish, Preethu; Ghaisas, Smita

doi:10.5281/zenodo.10298950

Published December 9, 2023 | Version v1

Computational notebook Open

Text-to-text Generation for Issue Report Classification

1. TCS Research

Submission for the NLBSE Issue Report Tool Competition

This package accompanies the submission titled "Text-to-text Generation for Issue Report Classification" to the NLBSE Issue Report Tool Competition. The package provides resources for replicating the experiments and results presented.

Description of ZIP Files:

issue_classification_t5: This archive contains the code for replicating the study, including the retrieval of the pre-trained model, fine-tuning procedures, and inference execution.
1. code: Contains all the code files.
  1. finetuning.py: The contents of this file comprise the code for fine-tuning the VMware/flan-t5-large-alpaca model on the issue report classification task. Additionally, embedded comments provide guidance on executing the fine-tuning process. Be sure to read the embedded comments.
  2. inference.py: This file contains the codebase for conducting inference using the fine-tuned model. Similar to the fine-tuning script, instructions for running the inference process are embedded as comments within the file.
  3. download_plm.py: This file contains the code for downloading VMware/flan-t5-large-alpaca from https://huggingface.co/VMware/flan-t5-large-alpaca .
  4. requirements.txt: This file enumerates the required Python modules and their respective versions necessary for the successful execution of the provided code.
2. data: Folder contains the NLBSE issue report classification data and model output after running inference using inference.py on issue-report-test.csv
  1. checkpoint-3000-output.csv: The contents of this CSV file present the output obtained after fine-tuning the VMware/flan-t5-large-alpaca model for 2 epochs (F1-score of 0.8297) on issue-report-train.csv and running the inference on issue-report-test.csv. Column 'label' contains the ground truth labels. Column 'Model generated output' contains the predicted label by the model.
  2. issue-report-train.csv: NLBSE24 isssue report tool competition train dataset. (Source: https://github.com/nlbse2024/issue-report-classification)
  3. issue-report-test.csv: NLBSE24 isssue report tool competition test dataset. (Source: https://github.com/nlbse2024/issue-report-classification)
finetuned_model_checkpoint-3000: This zip file contains the fine-tuned model (VMware/flan-t5-large-alpaca) to 2 epochs.

Technical info (English)

Environment details:

Operating System: Ubuntu 22.04
NVIDIA Driver Version: 470.141.03
NVIDIA CUDA Version: 12.2.1
Python version: 3.10
GPU Name: Nvidia A100
GPU Memory: 20 GiB
CPU Memory: 60 GiB

Note: We also attempted fine-tuning using a V100 GPU, and the results showed slight differences, potentially attributed to variations in GPU architecture. However, running inference on any GPU using the provided model finetuned_model_checkpoint-3000 should yield the same results as reported.

Files

finetuned_model_checkpoint-3000.zip

Files (8.5 GB)

Name	Size	Download all
finetuned_model_checkpoint-3000.zip md5:81ec72040c9ceedc4fb1a541c3c48b3e	8.5 GB	Preview Download
issue_classification_t5.zip md5:cfc56d32ec2319fe4b162a234f0b6a7e	2.6 MB	Preview Download

	All versions	This version
Views	95	95
Downloads	34	34
Data volume	93.8 GB	93.8 GB

Text-to-text Generation for Issue Report Classification

Authors/Creators

Description

Technical info (English)

Files

finetuned_model_checkpoint-3000.zip

Files (8.5 GB)