GitHub Copilot, one of the many modern tools for creating code suggestions with the help of AI models, is still a problem for some users due to licensing and telemetry concerns that the software sends back to the Microsoft-owned company.
So Brendan Dolan Japhet, assistant professor in the Department of Computer Science and Engineering at NYU Tandon, released FauxPilot, an alternative to Copilot that works locally, without a home phone call to parent Microsoft.
Copilot is based on OpenAI Codex, a GPT-3-based natural language transformation system that has been trained on “billions of lines of generic code” in GitHub repositories. This made Free and Open Source Software (FOSS) advocates uneasy because Microsoft and GitHub failed to identify exactly which repositories reported to Codex.
As Bradley Kuhn, Policy Fellow at the Software Freedom Conservancy (SFC), wrote in a blog post earlier this year, “Copilot leaves copyleft compliance as a user exercise. Users will likely face increased liability that only increases as Copilot improves. Users currently not They have any ways besides chance and educated guesswork to know if a Copilot production is being copyrighted by someone else.”
Shortly after GitHub made Copilot commercially available, the SFC urged open source maintainers not to use GitHub in part because of its refusal to address concerns about Copilot.
Not a perfect world
The FauxPilot Codex is not used. It is based on Salesforce’s CodeGen model. However, it is unlikely that free and open source software advocates will be satisfied because CodeGen has also been trained to use public open source code regardless of the nuances of the different licenses.
Dolan-Gavitt explained in a phone interview with record. “So there are still some issues, probably with licensing, that won’t be resolved by this.”
On the other hand, if someone with enough computational power comes up and says, ‘I’m going to train a model that’s only trained in GPL code or has a license that allows me to reuse it without attribution’ or something like that, they can train their model, and drop that model into FauxPilot and use this form instead.”
For Dolan-Gavitt, the primary goal of FauxPilot is to provide a way to run AI assistance software locally.
“There are people who have privacy concerns, or perhaps, in the case of business, some company policies that prevent them from sending their code to a third party, and that certainly helps by being able to run it locally,” he explained.
GitHub, in its description of the data collected by Copilot, describes an option to disable the collection of code snippets, which includes “source code you’re editing, related and other files open in the same IDE or editor, repositories URLs and file paths”.
But doing so does not appear to disrupt the collection of user interaction data – “user modification actions such as accepted and rejected completions, general error and usage data to determine metrics such as response time and feature sharing” and possibly “personal data, such as aliased identifiers.”
Dolan-Gavitt said he sees FauxPilot as a research platform.
“The only thing we want to do is train code samples that hopefully will produce more secure code,” he explained. “Once we do that we’re going to want to be able to test it and maybe even test it with actual users with something like Copilot but with our own models. So that was kind of an incentive.”
Doing so, however, there are some challenges. “Right now, it’s a little impractical to try to build a dataset that doesn’t have any vulnerabilities because the models are really data-hungry,” Dolan-Gavitt said.
“So they want lots and lots of code to practice with. But we don’t have very good or foolproof ways to ensure the code is bug-free. So it would be a huge amount of work to try and organize a data set that was free of vulnerabilities.”
However, Dolan-Gavitt, who co-authored a paper on the insecurity of Copilot code suggestions, found the AI assistance helpful enough to stick with it.
“My personal feeling about this is that I’ve basically been running the co-pilot since it was introduced last summer,” he explained. “I find it really useful. However, I kind of have to check it works again. But it’s often easier for me to at least start with something that gives me and then tweak it properly rather than trying to build it from scratch.” ®
#FauxPilot #Microsofts #GitHub #copilot #telemetry