Published August 26, 2020 | Version 1.0
Dataset Open

Python Annotated Code Search (PACS) Datasets & Pretrained Models

  • 1. Nokia Bell Labs

Description

This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch 

Datasets
There are three types of datasets:

  • snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated
  • code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test
  • training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20

The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:

The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange,  LICENSE). 

Pre-trained models
Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License.

  • ncs-embedder-so-ds-feb20
  • ncs-embedder-staqc-py
  • tnbow-embedder-so-ds-feb20
  • use-embedder-pacs
  • ensemble-embedder-pacs

Files

Files (3.0 GB)

Name Size Download all
md5:f9a23e7c58bcbf4c896b890a98501cf6
135.8 kB Download
md5:0d82a1b5713b9e0f963cc28edd951c73
20.1 kB Download
md5:01d87a5d124d4cc3ae894d388324c851
392 Bytes Download
md5:d3e03f182aefc6567abf5bb24a882763
758.2 MB Download
md5:18d95bf20a8a86fd47c8afbad4df68af
836.6 MB Download
md5:8f3edaffa04c8c4cadfb797297377ae6
20.0 MB Download
md5:db5d54fb1b7a3cb73f790a8c471b86df
49.6 kB Download
md5:3cd13b2a9017bd802c873155c5b31ac9
42.0 kB Download
md5:19be23d9c56c40ba66e9c7deaf9e9b29
1.8 MB Download
md5:618a9d44d709c7b406e79df2168ae34a
10.6 MB Download
md5:613a54309f4822d663a9fb97e27065f3
27.3 MB Download
md5:bd5f77beb928b37f4b0168a3e944d2b6
30.7 MB Download
md5:9dc608fa6990f005377241fbb4ffa9f6
105.5 kB Download
md5:06ad578a6c2f5806feded4940c2fdd3f
85.1 kB Download
md5:e52211856e268be0a0f97942951e3b1b
764.8 MB Download
md5:7620b07cb1385465dff49efefc162b2b
549.9 MB Download