Citation and Research Objects: Toward Active Research Objects

Daniel S. Katz
NCSA, CS, ECE, & iSchool
University of Illinois at Urbana-Champaign
Urbana, IL, USA
d.katz@ieee.org

Abstract This extended abstract, submitted to the Workshop on Research Objects 2019 (RO2019), being held in conjuction with the 15th International Conference on eScience (eScience 2019), discusses the state of citation for software and for data, and explains challenges in citing more complex research objects, including Research Objects. It includes a proposal for how current Research Objects and their contents can be cited, and additionally proposes Active Research Objects to make this easier.

Index Terms software citation, data citation, Research Objects

I. Introduction

In the last 5 or 6 years, the research community has made good progress in creating principles for citing simple research objects, such as data  [1] and software  [2]. We have arrived at different principles for these two types of simple objects because they are fundamentally different, for example, they are treated differently by copyright law and have different types of appropriate licenses  [3]. Additionally in the recent past, a number of community efforts arose to implement these citation principles (e.g., the FORCE11 Data Citation Implementation Group, the FORCE11 Software Citation Implementation Working Group), for data most recently in the context of the FAIR data principles  [4] (e.g. Enabling FAIR Data). Note that there is no widely accepted equivalent of the FAIR data principles for software or for other research objects, though some researchers are working in this area.

Currently, citing these simple research objects (data and software) is possible, and is relatively straightforward, though not yet common practice, particularly for software. Doing so follows the example of the long-established method for citing papers:

  1. The item and associated metadata are deposited in a repository.
  2. The repository stores/archives the item and metadata, and provides an identifier that can be used to retrieve them.
  3. The identifier and metadata are used to cite the object.

However, the situation for complex research objects (objects that contain other objects) is not as clear. While the entire object can be cited as a single object, in many cases, the content objects may also have their own identifiers, particularly if they existed independently before they were added to the research object. It’s beneficial to be able to track the references to them, whether they are independent or are contained in a complex research object, and ideally, using as few identifiers as possible.

In this extended abstract, I propose a simple means for citation of existing complex research objects and their contents, then an extension of those objects to better support their citation.

II. Citing Research Objects and their contents

My first proposal is to treat the current complex research object as a container and a set of contents, and to cite the object itself and all the objects it contains that were used.

Recent work in the FORCE11 Software Citation Implementation Working Group has led to the definition of a set of challenges  [5]. One of these challenges is how to cite complex software objects, namely frameworks that include components. A framework can have dozens or thousands of components, only some of which are used in a particular instance, so a set of citations for that instance should include a citation of the framework and citations of the components that were used.

Citation of Research Objects (ROs) [6] leads to a similar challenge: while the RO itself should be cited, it also makes sense to cite the objects in the RO that are used, but not those that are not. And citations of those objects in the RO can then be handled similarly to how those objects outside an RO would be cited, whether they are data, software, or something else. However, it’s worth noting that this relies on a separability of the objects, which is not the case for some complex research objects, e.g., Jupyter Notebooks, where all the software, data, and text are bundled in such a way that they cannot be separated and individually cited.

The necessary steps are thus:

  1. Tracking what parts of the RO was used (both the RO itself and the objects within it)
  2. Finding identifiers and other citation metadata for the RO and the objects within it that were used
  3. Building correctly formatted citations for the RO and the objects within it that were used

Step 1 is the greatest challenge. With current Research Objects, this must be done outside the RO, either manually or by tools that use the RO (e.g., an electronic notebook system). For Steps 2 and 3, I suggest that the RO itself be treated as a data object, and that it should follow the data citation principles. Citing the software, data, and documentation objects in an RO then is the same as it would be for independent software and data objects and for papers. Regarding identifiers for the contents, they may either have identifiers already based on their existence outside the RO, or they can be given identifiers when the RO is given an indentifier, with suitable relationship metadata between the RO and the content.

III. Active Research Objects and Citation

My second proposal is to expand the notion of Research Objects to include a more object-oriented approach, called Active Research Objects (AROs), where we add internal data and methods to the RO.

ARO methods would include put() and get() to place and access the object within the ARO. Put() would require additional data beyond the object being placed, for example, an external identifier (e.g., DOI) and a citation, and perhaps also an internal identifier (e.g., IDO  [7]), in addition to data currently required by many ROs, including description, checksum, etc. To provide fixity, a validate() method may also exist.

ARO data would include a set of flags for each internal object, initially set to false when the object is put and then set to true if the object is accessed (via the ARO’s get() method).

Another ARO method would be citation(), similar to the citation method in R  [8], except that it could be used to obtain the citation for the RO as a whole, that citations for the RO and any internal objects that have been used, or the citation for one specific internal object.

References

[1]    Data Citation Synthesis Group, “Joint declaration of data citation principles,” M. Martone, Ed. San Diego CA: FORCE11, 2014. [Online]. Available: https://doi.org/10.25490/a97f-egyk

[2]    A. M. Smith, D. S. Katz, K. E. Niemeyer, and FORCE11 Software Citation Working Group, “Software citation principles,” PeerJ Computer Science, vol. 2, p. e86, Sep. 2016. [Online]. Available: https://doi.org/10.7717/peerj-cs.86

[3]    D. S. Katz, K. E. Niemeyer, A. M. Smith, W. L. Anderson, C. Boettiger, K. Hinsen, R. Hooft, M. Hucka, A. Lee, F. Löffler, T. Pollard, and F. Rios, “Software vs. data in the context of citation,” PeerJ Preprints, vol. 4, p. e2630v1, Dec. 2016. [Online]. Available: https://doi.org/10.7287/peerj.preprints.2630v1

[4]    M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-Beltran, A. J. G. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. C. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, “The FAIR guiding principles for scientific data management and stewardship,” Scientific Data, vol. 3, p. 160018, Mar. 2016. [Online]. Available: https://doi.org/10.1038/sdata.2016.18

[5]    D. S. Katz, D. Bouquin, N. P. C. Hong, J. Hausman, C. Jones, D. Chivvis, T. Clark, M. Crosas, S. Druskat, M. Fenner, T. Gillespie, A. González Beltrán, M. Gruenpeter, T. Habermann, R. Haines, M. Harrison, E. A. Henneken, L. J. Hwang, M. B. Jones, A. A. Kelly, D. N. Kennedy, K. Leinweber, F. Rios, C. B. Robinson, I. Todorov, M. Wu, and Q. Zhang, “Software citation implementation challenges,” arXiv, vol. 1905.08674, 2019. [Online]. Available: http://arxiv.org/abs/1905.08674

[6]    S. Bechhofer, I. Buchan, D. De Roure, P. Missier, J. Ainsworth, J. Bhagat, P. Couch, D. Cruickshank, M. Delderfield, I. Dunlop, M. Gamble, D. Michaelides, S. Owen, D. Newman, S. Sufi, and C. Goble, “Why linked data is not enough for scientists,” Future Generation Computer Systems, vol. 29, no. 2, pp. 599–611, Feb. 2013. [Online]. Available: https://doi.org/10.1016/j.future.2011.08.004

[7]    R. Di Cosmo, M. Gruenpeter, and S. Zacchiroli, “Identifiers for Digital Objects: the Case of Software Source Code Preservation,” in iPRES 2018 - 15th International Conference on Digital Preservation, Boston, United States, Sep. 2018, pp. 1–9. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01865790

[8]    K. Hornik, “Citing R,” in R FAQ: Frequently Asked Questions on R, 2018. [Online]. Available: https://cran.r-project.org/doc/FAQ/R-FAQ.html#Citing-R