GitHub Copilot apparently violating open source licensing, says programmers

Kip Kniskern

GitHub Copilot for Business

Looking for more info on AI, Bing Chat, Chat GPT, or Microsoft's Copilots? Check out our AI / Copilot page for the latest builds from all the channels, information on the program, links, and more!

Microsoft acquired GitHub back in 2018 for $7.5 billion, and since then has been integrating the code repository into its developer tools while maintaining a largely hands-off approach. However, writer, lawyer, and programmer Matthew Butterick has some issues with Microsoft’s machine-learning based code assistant, GitHub Copilot, and the way it is apparently mishandling open-source licenses.

GitHub Copilot works by offering “suggestions” for code completion as you type, and is a plugin available for Visual Studio and other IDEs. the AI based system is powered by Codex. But it’s the way the AI is trained, or more precisely from where it’s trained, that is becoming a problem for developers like Butterick.

According to OpenAI, the developers of Codex (which is licensed by Microsoft):

Codex was trained on “tens of mil­lions of pub­lic repos­i­to­ries” includ­ing code on GitHub. Microsoft itself has vaguely described the train­ing mate­r­ial as “bil­lions of lines of pub­lic code”. But Copi­lot researcher Eddie Aftandil­ian con­firmed in a recent pod­cast (@ 36:40) that Copi­lot is “train[ed] on pub­lic repos on GitHub”.

Episode 533: Eddie Aftandilian on GitHub Copilot

The problem here is that these public repos that GitHub is trained on are licensed, and require attribution when code from the repositories is used. Microsoft has been vague about its use of the code, calling it fair use, but Copilot can not only offer suggestions but emit verbatim bits of code, as shown by Texas A&M Professor and GitHub user Tim Davis:

For programmers like Butterick, who contribute open source code out of a sense of community, stripping any attribution away from their work is a problem:

Arguably, Microsoft is cre­at­ing a new walled gar­den that will inhibit pro­gram­mers from dis­cov­er­ing tra­di­tional open-source com­mu­ni­ties. Or at the very least, remove any incen­tive to do so. Over time, this process will starve these com­mu­ni­ties. User atten­tion and engage­ment will be shifted into the walled gar­den of Copi­lot and away from the open-source projects them­selves—away from their source repos, their issue track­ers, their mail­ing lists, their dis­cus­sion boards. This shift in energy will be a painful, per­ma­nent loss to open source.

You can check out Butterick’s “GitHub Copilot investigation” for more information.