GitHub Copilot Is Using Your Code for AI Training

The video warns GitHub users about a new default setting that allows GitHub Copilot to use their code and interactions, including from private repositories, for AI training unless they explicitly opt out by April 24th, raising significant privacy concerns. It also suggests alternatives like self-hosted Git platforms such as Gitea and Forj to maintain control over code privacy and urges viewers to review their settings and share the information widely.

The video serves as a public service announcement urging GitHub users to disable a new default setting that allows GitHub Copilot to use their code and interactions for AI training. The creator received an email from GitHub informing users that starting April 24th, unless they opt out, their inputs, outputs, code snippets, and related context from Copilot usage will be used to train and improve AI models. This change means that by default, GitHub will collect and leverage user data from both public and private repositories when interacting with Copilot, which raises privacy concerns.

The video explains how to opt out of this data collection by navigating to GitHub account settings, then to Copilot settings, and disabling the option that allows GitHub to use data for AI model training. The creator emphasizes the importance of checking this setting even if users believe they have already opted out, as the default is now to participate unless explicitly disabled. The video also references an official GitHub blog post that details the data usage policy and clarifies what types of data are collected, including code context, comments, file names, and user feedback on suggestions.

A significant concern highlighted is the ambiguity around what constitutes private data, especially when using Copilot in private repositories. Although GitHub states that private repositories at rest are not used for training, the interaction data generated while using Copilot in these repositories may still be collected and used. This blurs the line between private and public data, potentially exposing sensitive code to AI training without explicit consent, which the creator finds troubling and misleading for users who expect their private repositories to remain confidential.

In response to these privacy issues, the video suggests alternatives to GitHub for hosting code, such as self-hosted Git platforms like Gitea and Forj. These platforms allow users to maintain full control over their repositories without the risk of their data being used for AI training. The creator recommends Gitea for its ease of use and polished interface, especially for those with home labs or personal servers, and mentions Forj as another viable option that can run on minimal hardware like a Raspberry Pi.

The video concludes by expressing frustration with Microsoft’s and GitHub’s data policies, which seem increasingly focused on harvesting user data under the guise of improving AI tools. The creator encourages viewers to share this information widely before the April 24th deadline and invites feedback on the policy changes. Finally, the video promotes additional Linux learning resources available on the creator’s website for those interested in expanding their technical knowledge.