Natural Language Inference Dataset for Software Engineering

For RE Conference

doi:10.5281/zenodo.8025035

There is a newer version of the record available.

Published June 9, 2023 | Version v2

Dataset Open

Natural Language Inference Dataset for Software Engineering

For RE Conference

This repository introduces a specialized NLI dataset designed to optimize the performance of the language models targeted for tackling NLP tasks related to the software engineering domain. We consider different texts from the software engineering domain and manually curated the entailment relationships among different sentences. The source of the sentence pairs includes PROMISE dataset, PURE dataset, user guides of different softwares, articles on operating systems (e.g. official documentation of Windows, Mac), databases (e.g. official documentation of MongoDB, Oracle), cyber security (MITRE documentation) and software product descriptions including AWS documentation. Our created text entailment dataset contains 10k sentence pairs manually labeled for balanced classification with the labels Entailment, Contradiction, and Neutral.

Files

TrainNLI.txt

Files (375.6 MB)

Name	Size	Download all
TrainNLI.txt md5:f71e2eaeaa60c9a5340410543e3fe503	375.6 MB	Preview Download

Additional details

Is cited by: Dataset: 10.5281/zenodo.8020689 (DOI)

454

Views

237

Downloads

Show more details

	All versions	This version
Views	454	132
Downloads	237	83
Data volume	93.1 GB	32.3 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: June 11, 2023
Modified: June 12, 2023

Natural Language Inference Dataset for Software Engineering

Authors/Creators

Description

Files

TrainNLI.txt

Files (375.6 MB)

Additional details

Related works