Published June 19, 2023 | Version v1
Dataset Open

PragmaticCode

  • 1. ROR icon Microsoft Research (India)
  • 2. Microsoft Research (Redmond)

Description

PragmaticCode

Introduction

This repository hosts the official data artifact for PragmaticCode dataset from the paper "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context" appearing at NeurIPS 2023 ("Guiding Language Models of Code with Global Context using Monitors" on Arxiv). The full code and data artifact along with detailed instructions in available in the official repository at https://github.com/microsoft/monitors4codegen. The work introduces Monitor-Guided Decoding (MGD) for code generation using Language Models, where a monitor uses static analysis to guide the decoding.

PragmaticCode is a dataset of real-world open-source Java projects complete with their development environments and dependencies (through their respective build systems). The authors tried to ensure that all the repositories in PragmaticCode were released publicly only after the determined training dataset cutoff date (31 March 2022) for the CodeGen, SantaCoder and text-davinci-003 family of models, which were used to evaluate MGD.

The list of repositories along with their respective licenses, and path to zipped repository content is available in PragmaticCode/repos.csv. The zipped contents of the full repositories is available under PragmaticCode/github. The contents of the files required for inference for each of the repositories is available in PragmaticCode/fileContentsByRepo.json.

DotPrompts

For evaluation of Language Models of Code, the authors curate a set of 10,000+ examples spanning 1400+ methods from PragmaticCode, such that each example consists of a prompt to a dereference location (a code location having the "." operator in Java). This can be used to benchmark Language Models of Code on their ability to utilize repository level context to generate code for method-level completion tasks. The task for the models is to complete a partially written Java method, utilizing the full repository available from PragmaticCode. Since all the repositories in PragmaticCode are buildable, DotPrompts supports Compilation Rate as a metric of evaluation for generated code, apart from standard metrics of ground truth match like Next-Identifier Match, Identifier Sequence Match and Prefix Match. Further details on DotPrompts and its usage is available at https://github.com/microsoft/monitors4codegen#dotprompts.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft

trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines.

Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.

Any use of third-party trademarks or logos are subject to those third-party's policies.

Files

PragmaticCode.zip

Files (406.1 MB)

Name Size Download all
md5:3256aa6fa6bbfdc73a57d2ab8d1ab520
406.1 MB Preview Download

Additional details