Published October 26, 2023 | Version 1.0.6
Dataset Open

Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science

  • 1. Ecosyste.ms

Description

A collection of useful datasets extracted from https://packages.ecosyste.ms and https://repos.ecosyste.ms for use at the CZI Hackathon: Mapping the Impact of Research Software in Science.

All data is provided as NDJSON (new line delimited JSON), each line represents a valid JSON object, and they are separated by newline characters. There are python and R libraries for reading these files, or you can maually read each line and parse each line as a single JSON object.

Each ndjson file has been compressed with gzip (actual command: `tar -czvf`) to reduce download size, they expand to significantly bigger files after extraction.

Package Data

Package names from cran, bioconductor and pypi that have been parsed by the software-mentions project (data: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c) are collected together with their latest release at time of publishing along with the names of their dependencies, those dependency names have then also been recursively fetched with latest release and dependencies until the full list of transitive dependencies is included. 

Note: This approach uses a simplified method of dependency resolution, always picking the latest version of each package rather than taking into account each dependencies specific version range requirements, this is primarily due to time constraints and allows all software ecosystems to be processed in the same way. A future improvement would be to use each package ecosystem's specific dependency resolution algorithm to compute the full transitive dependency tree for each mentioned software package.

GitHub Data

Two different approaches were taken for collecting data for referenced GitHub mentions:

1. `github.ndjson` is metadata for each repository from GitHub, including "manifest" files which are known files that contain dependency information for a project such as requirements.txt, DESCRIPTION and package.json, parsed using https://github.com/ecosyste-ms/bibliothecary, which may include transitive dependencies that have been discovered in a `lockfile` within the repository.

2. `github_packages.ndjson` is metadata for each package that was found on any package manager that references the GitHub url as it's repository url/source/homepage, these packages, like the cran and pypi data above, include the latest release and their direct dependencies. There may be more than one package for each GitHub URL as it is a one to many relationship. `github_packages_with_transitive.ndjson` follows the same format but also includes the extra resolved transitive dependencies of all packages using the same approach as with cran and pypi data above with the same caveats. 

There are also many more ecosystems referenced in these files than just cran, bioconductor and pypi, https://packages.ecosyste.ms provides a standardized metadata format for all of them to enable comparison and simplification of automation.

Contact

If you would like any help, support or more data from Ecosyste.ms please do get in touch via email: hello@ecosyste.ms or open an issue on GitHub: https://github.com/ecosyste-ms/packages/issues

Files

Files (2.5 GB)

Name Size Download all
md5:6fffedfa2af4148313ff647d11a17dfa
823.0 kB Download
md5:b9987720978b574f2571e6931cc5b5b1
1.0 MB Download
md5:fcf1049accdd3cfc532af92a9d153243
30.4 MB Download
md5:44452f16fb47972804a4e34a500cd5ae
50.5 MB Download
md5:46026c56a49758bc0bd6b1c73f179dc6
5.4 MB Download
md5:1dbd7ce391e5f1ddc45cf55ccb7192ea
5.9 MB Download
md5:77c44077a2de451c53671ad8390d872b
622.9 MB Download
md5:36a92a8969eb9eb9a9726dbddcdde5f3
731.9 MB Download
md5:f7594f53f49d927f719671b2da3ec4fd
1.0 GB Download
md5:27ffcfb1a81a3215dade6291bc375fe5
30.8 MB Download
md5:4ca5a9a3dcc4839a62ec3cfaa4844630
31.2 MB Download

Additional details

References