Following a temporary ban in Italy and a spate of inquiries in other EU nations, OpenAI has just over a week to comply with European data protection regulations. If it fails, it may be fined, forced to destroy data, or even banned.
However, experts have told MIT Technology Review that OpenAI will be unable to comply with the standards.
This is due to the method through which the data used to train its AI models was gathered: by scraping information from the internet.
The mainstream idea in AI development is that the more training data there is, the better. The data set for OpenAI's GPT-2 model was 40 terabytes of text. GPT-3, on which ChatGPT is based, was trained on 570 GB of data. OpenAI has not shared how big the data set for its latest model, GPT-4, is.
However, the company's desire for larger models is now coming back to haunt it. Several Western data protection agencies have begun inquiries into how OpenAI obtains and analyses the data that powers ChatGPT in recent weeks. They suspect it scraped personal information from people, such as names and email addresses, and utilized it without their permission.
As a precaution, the Italian authorities have restricted the use of ChatGPT, while data regulators in France, Germany, Ireland, and Canada are all looking into how the OpenAI system collects and utilizes data. The European Data Protection Board, the umbrella organization for data protection agencies, is also forming an EU-wide task force to coordinate investigations and enforcement in the context of ChatGPT.
The Italian government has given OpenAI until April 30 to comply with the rules. This would imply that OpenAI would need to obtain authorization from individuals before scraping their data, or demonstrate that it had a "legitimate interest" in acquiring it. OpenAI will also have to explain to users how ChatGPT utilizes their data and provide them the ability to correct any errors the chatbot makes about them, have their data destroyed if they wish, and object to the computer program using it.
If OpenAI is unable to persuade authorities that its data-use practices are legal, it may be prohibited in individual nations or possibly the entire European Union. It may also face substantial penalties and be compelled to erase models and the data used to train them, says Alexis Leautier, an AI expert at the French data protection agency CNIL.
Game of high stakes
The stakes for OpenAI could not be higher. The EU's General Data Protection Regulation is the harshest data protection system in the world, and it has been widely copied around the world. Regulators from Brazil to California will be watching closely what happens next, and the outcome could profoundly transform the way AI businesses collect data.
In addition to being more transparent about its data practices, OpenAI will have to demonstrate that it is collecting training data for its algorithms in one of two legal ways: consent or "legitimate interest."
It appears unlikely that OpenAI will be able to claim that it obtained people's permission for collecting their data. That leaves the argument that it had a "legitimate interest" in doing so. According to Edwards, this will likely need the corporation making a compelling case to regulators about how critical ChatGPT is in order to legitimize data collecting without consent.
According to MIT Review, OpenAI thinks it conforms with privacy rules, and it strives to delete personal information from training data upon request "where feasible." As per the firm its models are trained using publicly available content, licenced content, and content created by human reviewers. But that's too low a hurdle for the GDPR.
“The US has a doctrine that when stuff is in public, it's no longer private, which is not at all how European law works,” says Edwards. The GDPR gives people rights as “data subjects,” such as the right to be informed about how their data is collected and used and to have their data removed from systems, even if it was public in the first place.
Looking for a needle in a haystack
Another issue confronts OpenAI. According to the Italian regulator, OpenAI is not being upfront about how it obtains data from users during the post-training phase, such as in chat logs of their interactions with ChatGPT. As stated by Margaret Mitchell, an AI researcher and chief ethical scientist at startup Hugging Face who was previously Google's AI ethics co-lead, identifying individuals' data and removing it from its models will be nearly impossible for OpenAI.
She claims that the corporation might have avoided a major difficulty by incorporating robust data record-keeping from the outset. Instead, in the AI sector, it is typical to construct data sets for AI models by indiscriminately scanning the web and then outsourcing the labour of deleting duplicates or irrelevant data points, filtering undesired stuff, and repairing mistakes. Because of these methodologies, as well as the sheer magnitude of the data collection, tech companies typically have a very limited grasp of what went into training their models.
Finding Italian data in ChatGPT's massive training data set will be like looking for a needle in a haystack. Even if OpenAI is successful in erasing users' data, it is unclear whether this is a permanent step. According to studies, data sets can be found on the internet long after they have been destroyed since copies of the original can be found.
“The state of the art around data collection is very, very immature,” says Mitchell. That’s because tons of work has gone into developing cutting-edge techniques for AI models, while data collection methods have barely changed in the past decade.
In the AI community, work on AI models is overemphasized at the expense of everything else, says Sambasivan. “Culturally, there’s this issue in machine learning where working on data is seen as silly work and working on models is seen as real work,” Mitchell agrees.