Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science
Description
A collection of useful datasets extracted from https://packages.ecosyste.ms and https://repos.ecosyste.ms for use at the CZI Hackathon: Mapping the Impact of Research Software in Science.
All data is provided as NDJSON (new line delimited JSON), each line represents a valid JSON object, and they are separated by newline characters. There are python and R libraries for reading these files, or you can maually read each line and parse each line as a single JSON object.
Each ndjson file has been compressed with gzip (actual command: `tar -czvf`) to reduce download size, they expand to significantly bigger files after extraction.
Package Data
Package names from cran, bioconductor and pypi that have been parsed by the software-mentions project (data: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c) are collected together with their latest release at time of publishing along with the names of their dependencies, those dependency names have then also been recursively fetched with latest release and dependencies until the full list of transitive dependencies is included.
Note: This approach uses a simplified method of dependency resolution, always picking the latest version of each package rather than taking into account each dependencies specific version range requirements, this is primarily due to time constraints and allows all software ecosystems to be processed in the same way. A future improvement would be to use each package ecosystem's specific dependency resolution algorithm to compute the full transitive dependency tree for each mentioned software package.
GitHub Data
Two different approaches were taken for collecting data for referenced GitHub mentions:
1. `github.ndjson` is metadata for each repository from GitHub, including "manifest" files which are known files that contain dependency information for a project such as requirements.txt, DESCRIPTION and package.json, parsed using https://github.com/ecosyste-ms/bibliothecary, which may include transitive dependencies that have been discovered in a `lockfile` within the repository.
2. `github_packages.ndjson` is metadata for each package that was found on any package manager that references the GitHub url as it's repository url/source/homepage, these packages, like the cran and pypi data above, include the latest release and their direct dependencies. There may be more than one package for each GitHub URL as it is a one to many relationship. `github_packages_with_transitive.ndjson` follows the same format but also includes the extra resolved transitive dependencies of all packages using the same approach as with cran and pypi data above with the same caveats.
There are also many more ecosystems referenced in these files than just cran, bioconductor and pypi, https://packages.ecosyste.ms provides a standardized metadata format for all of them to enable comparison and simplification of automation.
Contact
If you would like any help, support or more data from Ecosyste.ms please do get in touch via email: hello@ecosyste.ms or open an issue on GitHub: https://github.com/ecosyste-ms/packages/issues
Files
Files
(2.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:6fffedfa2af4148313ff647d11a17dfa
|
823.0 kB | Download |
|
md5:b9987720978b574f2571e6931cc5b5b1
|
1.0 MB | Download |
|
md5:fcf1049accdd3cfc532af92a9d153243
|
30.4 MB | Download |
|
md5:44452f16fb47972804a4e34a500cd5ae
|
50.5 MB | Download |
|
md5:46026c56a49758bc0bd6b1c73f179dc6
|
5.4 MB | Download |
|
md5:1dbd7ce391e5f1ddc45cf55ccb7192ea
|
5.9 MB | Download |
|
md5:77c44077a2de451c53671ad8390d872b
|
622.9 MB | Download |
|
md5:36a92a8969eb9eb9a9726dbddcdde5f3
|
731.9 MB | Download |
|
md5:f7594f53f49d927f719671b2da3ec4fd
|
1.0 GB | Download |
|
md5:27ffcfb1a81a3215dade6291bc375fe5
|
30.8 MB | Download |
|
md5:4ca5a9a3dcc4839a62ec3cfaa4844630
|
31.2 MB | Download |
Additional details
References
- Istrate, Ana-Maria et al. (2022). CZ Software Mentions: A large dataset of software mentions in the biomedical literature [Dataset]. Dryad. https://doi.org/10.5061/dryad.6wwpzgn2c