Forget me not: memorisation in generative sequence models trained on open source licensed code
- 1. KU Leuven Centre for IT & IP Law - imec
Description
Generative sequence models, like GPT-3/4, Stable Diffusion and DALL·E, are increasingly utilised to produce artifacts traditionally associated with human ingenuity, such as text, images, audio, videos and code. Despite their impressive ability to generalise on unseen data, these models are prone to memorising fragments of their training data. In some extreme cases, these ‘memories’ may contain verbatim and potentially infringing reproductions of works protected by copyright. In this paper, we focus on one specific example, namely program source code.
The ongoing litigation against Microsoft’s GitHub Copilot service shows that these concerns are far from theoretical. GitHub Copilot is a commercial service designed to support software development workflows. It generates code based on a program specification provided by a programmer in a natural language. The service relies on the generative model Codex, which has been trained on public open-source code repositories hosted on GitHub and fine-tuned for code generation. In the words of its creators, this model has been trained on ‘billions of lines of public code’, that is, computer programs arguably covered by copyright law and distributed under an open source licence.
Copilot has been shown capable of reproducing, on occasion, verbatim fragments of what is allegedly its training dataset without appropriate attribution or notice. These reproductions have included not only functional code, but also original, expressive code plausibly protected by copyright. While open source software is, by definition, distributed with its source, many open source licences follow a direct licensing model where attribution, notice and licence notice preservation requirements must be observed to avoid downstream recipients being found in breach of the licence.
This controversy has sparked heated debates in both deep learning and legal communities as to the legality of developing and using such models under copyright law. In this paper, we explore the implications of memorisation for copyright infringement under EU law and propose a set of solutions that may help alleviate these concerns.
Files
GenAI_Memorisation_Open_Source_IEmanuilov_TMargoni-Preprint_Zenodo.pdf
Files
(1.7 MB)
Name | Size | Download all |
---|---|---|
md5:f66e649b17c458b89b448aad5843cd9e
|
1.7 MB | Preview Download |
Additional details
Funding
Dates
- Created
-
2024-02-07