Search This Blog

Powered by Blogger.

Blog Archive

Labels

About Me

Showing posts with label Mailchimp. Show all posts

Private API Keys and Passwords Discovered in a Popular AI Training dataset

 

The Common Crawl dataset, which is used to train several artificial intelligence models, has over 12,000 legitimate secrets, including API keys and passwords. The Common Crawl non-profit organisation maintains a vast open-source archive of petabytes of web data collected since 2008, which is free to use. 

Because of the huge dataset, various artificial intelligence initiatives, including OpenAI, DeepSeek, Google, Meta, Anthropic, and Stability, may rely on the digital archive to train large language models (LLMs).

Truffle Security researchers discovered legitimate secrets after scanning 400 terabytes of data from 2.67 billion web pages in the Common Crawl December 2024 database. They uncovered 11,908 secrets that were successfully authenticated and were hardcoded by developers, highlighting that LLMs could be trained on insecure code.

It should be noted that LLM training data is not used in its raw form; instead, it is cleaned and filtered to remove extraneous content such as useless data, duplicate, malicious, or sensitive data. Despite these efforts, removing confidential data is challenging, and the method does not guarantee that all personally identifiable information (PII), financial data, medical records, and other sensitive content will be erased from the huge dataset. 

Truffle Security discovered legitimate API keys for the WalkScore, MailChimp, and Amazon Web Services (AWS) services after examining the scanned data. In the Common Crawl dataset, TruffleHog found 219 different secret kinds in total, with MailChimp API keys being the most prevalent. 

Cybersecurity researchers explain that the developers made a mistake by hardcoding them into HTML forms and JavaScript snippets rather than using server-side environment variables. An attacker could exploit these keys for nefarious purposes like phishing and brand impersonation. Furthermore, disclosing such knowledge could result in data exfiltration. Another feature of the paper is the high reuse rate of the uncovered secrets, with 63% found on several pages. 

However, a WalkScore API key "appeared 57,029 times across 1,871 subdomains." The researchers also discovered a homepage with 17 unique live Slack webhooks, which should be kept private because they allow apps to submit messages to Slack. After conducting the research, Truffle Security got in touch with the affected suppliers and collaborated with them to remove the keys belonging to their users. 

The researchers claim to have "successfully assisted those organisations collectively in rotating/revoke several thousand keys." Truffle Security's findings are a warning that insecure coding mistakes can affect the LLM's behaviour, even if an AI model uses older archives than the dataset the researchers analysed.