There is a newer version of the record available.

Published June 9, 2023 | Version v2
Dataset Open

Natural Language Inference Dataset for Software Engineering

Authors/Creators

Description

This repository introduces a specialized NLI dataset designed to optimize the performance of the language models targeted for tackling NLP tasks related to the software engineering domain. We consider different texts from the software engineering domain and manually curated the entailment relationships among different sentences. The source of the sentence pairs includes PROMISE dataset, PURE dataset, user guides of different softwares, articles on operating systems (e.g. official documentation of Windows, Mac), databases (e.g. official documentation of MongoDB, Oracle), cyber security (MITRE documentation) and software product descriptions including AWS documentation. Our created text entailment dataset contains 10k sentence pairs manually labeled for balanced classification with the labels Entailment, Contradiction, and Neutral.

Files

TrainNLI.txt

Files (375.6 MB)

Name Size Download all
md5:f71e2eaeaa60c9a5340410543e3fe503
375.6 MB Preview Download

Additional details

Related works

Is cited by
Dataset: 10.5281/zenodo.8020689 (DOI)