# Survey of reproducibility in _hep-lat_ submissions in 2021

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6584001.svg)](https://doi.org/10.5281/zenodo.6584001)

This repository makes available the results of a survey of all
submissions and cross-lists to the hep-lat [arxiv][arxiv] in 2021.

## Methodology

### Obtaining data

Papers were downloaded in PDF format
using the arXiv bulk downloader, and filtered
with a listing of papers from the [arXiv API][arxiv-api].

Information on which papers were conference talks were obtained
via the [Inspire API][inspire-api].

Both of these steps are documented in the file `data_acquisition.ipynb`.

Submissions with an identifier starting `21` (i.e. first appeared
on the arXiv between 1 January and 31 December 2021), categorised in
hep-lat (either as primary category or as a cross-list) are included.

### Surveying data

Each paper was skim-read at a very high-level, and search tools
(including [pdfgrep][pdfgrep]) were used to search for relevant
keywords.

Availability of data and workflows cited was verified. Otherwise,
beyond what is already mentioned, all information was taken from
what was reported in the submissions (and in a few cases in
articles explicitly cited); no effort was made to seek out tools
or data mentioned but not cited, but that may be publicly available.

Data were input into a Microsoft Excel spreadsheet.
Some features of Microsoft Excel were used to ensure consistency,
including conditional formatting and data validation.

## Data structure

The survey results are in the file `survey_2021.csv`. This is in
comma-separated format, with the columns as documented below.

### Fields

* `arXiv ID`: The arXiv identifier of the submission.
* `Primary category`: the primary arXiv category of the submission.
* `Journal`: The journal in which the submission was published, if known. (From the [Inspire API][inspire-api].)
* `Is proceedings`: Whether or not the work is from conference proceedings. Values are `Y`/`N`.
* `Presents new numerical results`: Whether or not the work includes any new numerical results. This includes a table containing numbers with error bars, or a plot with numbers on the axes, where these are not directly quoted from other work. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `UK authors`: Whether or not at least one author has at least one affiliation to an institution in the United Kingdom. Values are `Y`/`N`.
* `Generates field configurations`: Whether the work presented involved generating new field configurations by some Monte Carlo-adjacent algorithm. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Specifies software used for configuration generation`: Whether the work specifies what software application or applications were used for generating configurations. Values are as in "Citation formats" below.
* `Software used for configuration generation`: A comma-separated list of any software mentioned being used for generating field configurations.
* `Repository/hosting service for configuration generation code`: A comma-separated list of any data repositories or hosting services used for any software mentioned in the previous column. `Personal website` indicates any web page controlled by an individual, including personal home pages on institutional web servers.
* `Performs measurements`: Whether the work involves computing observables from gauge configurations. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Specifies software used for measurement`: Where work performed measurements, whether the work specifies what software application or applications were used for this. Values are as in "Citation formats" below.
* `Software used for measurements`: A comma-separated list of any software mentioned being used for performing measurements.
* `Repository/hosting service for measurement code`: A comma-separated list of any data repositories or hosting services used for any software mentioned in the previous column. `Personal website` indicates any web page controlled by an individual, including personal home pages on institutional web servers.
* `Uses existing configurations`: Whether the work makes use of field configurations generated as part of earlier work, either by the authors or by others. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Configuration hosting infrastructure acknowledged`: Ifthe work acknowledges the infrastructure used to host and distribute field configurations used (e.g. ILDG or a Regional Grid), a comma-separated list of the services acknowledged.
* `Cites existing configurations`: Whether the work provides a citation for any existing configurations used. Values are as in "Citation formats" below.
* `Configurations generated by`: A comma-separated list of collaborations whose field configurations were used.
* `Reanalyses other existing data`: Whether non-field configuration data from other work is incorporated into an analysis. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Cites other existing data`: Where the work uses existing field configurations, whether and how the work is acknowledged. Values are as in "Citation formats" below.
* `Publishes data`: Whether the work makes the data generated available in machine-readable format (i.e. beyond plots and numbers in tables in the PDF). Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Data available on request?`: Whether the work does not make data available, but claims that data would be available if requested. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Repository used for data`:  A comma-separated list of any data repositories or hosting services used for any data released. `Personal website` indicates any web page controlled by an individual, including personal home pages on institutional web servers.
* `Specifies software used for analysis`: Where the work generated numerical results, whether and how the work specifies any software used for analysing these results. This is any software used outside of generating configurations and performing measurements that led to the presented results. Values are as in "Citation formats" below.
* `Software used for analysis`: A comma-separated list of any software mentioned being used for data analysis.
* `Repository/hosting service for analysis code`: A comma-separated list of any data repositories or hosting services used for any software mentioned in the previous column. `Personal website` indicates any web page controlled by an individual, including personal home pages on institutional web servers.
* `Publishes parts of analysis`: Whether any aspect of the bespoke analysis workflow for this work was made available. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Publishes full analysis`: Whether software that will reproduce the full analysis was made available. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Acknowledges an HPC centre`: Whether computational time on shared computing facilities was acknowledged. Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Acknowledges Supercomputing Wales`: Whether the specific HPC service [Supercomputing Wales][scw] was acknowledged.
* `Acknowledges DiRAC`: Whether the specific HPC service [DiRAC][dirac] was acknowledged.
* `Review paper`: Whether the paper is a review or review-like (i.e. re-presents results previously presented elsewhere with attribution). Values are `Y`/`N`, with a `?` indicating that this is ambiguous.
* `Cites any other software`: Whether any other software not directly leading to the presented results is attributed. Values are as in "Citation formats" below.
* `Comments`: Any other comments not fitting into the above categories.

### Citation formats

* `No`: No information is given
* `Mentioned by name`: a piece of software, data, or collaboration is mentioned but no other information is provided
* `Data repository citation`: where a citation is made to a data repository such as [Zenodo][zenodo]
* `Paper citation`: a citation is made to a paper, with no other information on how to access the data or software
* `URL citation`: a URL is included in the references section/bibliogaphy
* `Included`: code is included as part of the publication
* `Footnote/inline URL`: a URL to the resource is included either inline in the text or in a footnote, but not in the references section/bibliography

## Analysis

A summary initial analysis of the data is presented in the Jupyter notebook
`analysis.ipynb`. Results from this analysis were presented
[at the UKLFT Annual Meeting in Liverpool in May 2022][uklft-talk] and [at the 39th annual symposium on Lattice Field Theory (LATTICE 2022)][lattice-talk].

## Version history

* 1.1.2: Fix error in previous release; minor plot tidying
* 1.1.1: Allow plotting style to be switched for Lattice 2022 proceedings
* 1.1.0: Updated analysis notebook as presented at Lattice 2022, removing split between UK and non-UK. Field "Lattice data grid acknowledged" renamed to "Configuration hosting infrastructure acknowledged" to avoid confusion with the specific ILDG Regional Grid called the "Lattice Data Grid".
* 1.0.1: minor update to cropping in analysis notebook
* 1.0.0: Initial release.

[arxiv]: https://arxiv.org
[arxiv-api]: https://arxiv.org/help/api/
[inspire-api]: https://github.com/inspirehep/rest-api-doc
[lattice-talk]: https://edbennett.github.io/lattice2022-survey-talk
[pdfgrep]: https://pdfgrep.org
[uklft-talk]: https://edbennett.github.io/uklft-talk-20220527
[zenodo]: https://zenodo.org
