The recent Microsoft data leak, stemming from the inadvertent sharing of open-source training data on GitHub by AI researchers, has been successfully addressed. Microsoft swiftly responded to a vulnerability that exposed a significant 38TB of private data from its AI research division.
The breach was uncovered by ethical hackers from cloud security firm Wiz, who identified a shareable link utilizing Azure Statistical Analysis System tokens on June 22, 2023. Promptly reporting their findings to the Microsoft Security Response Center, the SAS token was invalidated by June 24. Subsequently, on July 7, the token on the original GitHub page was replaced.
The exploit revolved around Shared Access Signature (SAS) tokens, a feature of Azure for file-sharing. Such tokens, when mishandled, can leave systems vulnerable. Wiz's initial detection of this vulnerability occurred during their search for improperly configured storage containers online, a known entry point for cloud-hosted data breaches.
Their investigation led them to 'robust-models-transfer', a repository housing open-source code and AI models used for image recognition within Microsoft's AI research division.
The root of the problem traced back to a Shared Access Signature token associated with an internal storage account. A Microsoft employee, while engaged in the development of open-source AI learning models, inadvertently shared a URL for a Blob store (a form of object storage in Azure) containing an AI dataset on a public GitHub repository. Leveraging this misconfigured URL, the Wiz team gained unauthorized access to the entire storage account.
Upon following the link, the hackers gained access to a repository containing disk backups of two former employees’ workstation profiles, along with internal Microsoft Teams messages. This repository housed a staggering 38TB of sensitive information, including secrets, private keys, passwords, and the open-source AI training data.
Notably, SAS tokens lack expiration dates, making them ill-suited for sharing critical data externally. In a security advisory on September 7, Microsoft underscored that "Attackers may create a high-privileged SAS token with long expiry to preserve valid credentials for a long period."
It's worth emphasizing that no customer data was compromised, and there was no threat to other Microsoft services stemming from the AI dataset exposure. This incident isn't unique to Microsoft's AI endeavors; any large-scale open-source dataset could potentially face similar risks. Wiz, in its blog post, highlighted that "Researchers collect and share massive amounts of external and internal data to construct the required training information for their AI models. This poses inherent security risks tied to high-scale data sharing."
To prevent similar incidents, organizations are advised to caution employees against oversharing data. In this instance, the Microsoft researchers could have safeguarded the public AI dataset by relocating it to a dedicated storage account. Additionally, vigilance against supply chain attacks is crucial. These attacks may occur if malicious code is injected into files that are accessible to the public due to improper permissions.
"As we see wider adoption of AI models within companies, it’s important to raise awareness of relevant security risks at every step of the AI development process, and make sure the security team works closely with the data science and research teams to ensure proper guardrails are defined,” the Wiz team wrote in their blog post.
Ami Luttwak, CTO and cofounder of Wiz, released the following statement to TechRepublic: “As AI adoption increases, so does data sharing. AI is built on collecting and sharing lots of large models and quantities of data, and so what happens is you get high volumes of information flowing between teams. This incident reveals the importance of sharing data in a secure manner. Wiz also recommends security teams gain more visibility into the process of AI research, and work closely with their development counterparts to address risks early and set guardrails.”