There is a newer version of the record available.

Published March 23, 2022 | Version 2021-03-23
Dataset Open

A Large-scale Dataset of (Open Source) License Text Variants

  • 1. LTCI, Télécom Paris, Institut Polytechnique de Paris

Description

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers.
The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing.
Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared.
The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.

Files

README.md

Files (15.2 GB)

Name Size Download all
md5:07603e18bc2e9c3d881037f26e9ba343
297.4 MB Download
md5:dcdf42ad8bb0e925d6ad4aa43f06971e
169.3 MB Download
md5:df57de6ab8373ed17ca3e97c2f8845e3
224.1 MB Download
md5:043acfc6894f19e6523320ad91401bd2
31.6 MB Download
md5:e9f1933a3c3e306706bc208b712d7b67
124.8 MB Download
md5:5218a6ee744039b9ed003689b378928a
14.1 GB Download
md5:73bf0aa5ffef74c94c667385d8620667
286.5 MB Download
md5:3a5474e4eb2d39b2b93a704cc613258d
9.1 kB Download
md5:88299622f1eb221619529b6cfc975851
7.9 kB Preview Download
md5:7827c39ae812b59bdcc8ac0eb93dd806
6.4 kB Download

Additional details

Related works

Is described by
Conference paper: 10.1145/3524842.3528491 (DOI)