Search This Blog

Powered by Blogger.

Blog Archive

Labels

Showing posts with label Cloud Firm. Show all posts

Faulty Upgrade at Cloudflare Results in User Data Loss

 

Cloudflare has disclosed a severe vulnerability with its logging-as-a-service platform, Cloudflare Logs, which resulted in user data loss due to an improper software update. The US-based connectivity cloud firm acknowledged that around 55% of log data generated over a 3.5-hour period on November 14, 2024, was permanently wiped out. This loss was caused by a succession of technical misconfigurations and system failures. 

Cloudflare logs collects event metadata from Cloudflare's global network and makes it available to customers for troubleshooting, compliance, and analytics. To speed up log delivery and avoid overloading users, the organisation uses Logpush, a system that collects and transmits data in manageable sums. An update to Logpush caused a series of system failures, disrupting services and resulting in data loss. 

The incident started with a configuration upgrade to enable support for an additional dataset in Logpush. A defect in the configuration generation system resulted in Logfwdr, a component responsible for forwarding logs, receiving an empty configuration. This error informed Logfwdr that no logs needed to be delivered. Cloudflare discovered the bug within minutes and reverted the update. 

However, rolling back the update triggered a separate, pre-existing issue in Logfwdr. This flaw, which was linked to a fail-safe technique designed to "fail open" in the event of configuration mistakes, caused Logfwdr to process and attempt to transmit logs for all customers, not just those with active setups. 

The unexpected rise in log processing overloaded Buftee, Cloudflare's log buffering system. Buftee is intended to keep distinct buffers for each customer to ensure data integrity and prevent interference between log operations. Under typical circumstances, Buftee manages millions of buffers worldwide. The large influx of data caused by the Logfwdr mistake boosted buffer demand by fortyfold, exceeding Buftee's capacity and rendering the system unresponsive. 

According to Cloudflare, addressing the issue needed a complete system reset and several hours of recovery time. During this time, the company was unable to transfer or recover the affected logs, which resulted in permanent data loss.

Cloudflare attributed the incident to flaws in its system security and configuration processes. While systems for dealing with such issues existed, they were not set up to handle such a large-scale failure. Buftee, for example, offers capabilities designed to handle unexpected surges in buffer demand, but these functions were not enabled, leaving the system vulnerable to overflow.

The company also stated that the fail-open mechanism in Logfwdr, which was established during the service's early development, has not been updated to match the much bigger user base and traffic levels. This error enabled the system to send logs for all clients, resulting in a resource spike that exceeded operational constraints. 

Cloudflare has apologised for the disruption and pledged to prevent similar instances in the future. The company is implementing new alerts to better detect configuration issues, improving its failover procedures to manage larger-scale failures, and doing simulations to verify system resilience under overload scenarios. 

Furthermore, Cloudflare is improving its logging design so that individual system components can better withstand cascading failures. While faults in complex systems are unavoidable, the company's priority is to minimise their impact and ensure that services recover fast. 

Last month, Cloudflare claimed successfully managing the largest recorded distributed denial-of-service (DDoS) assault, which reached 3.8 terabits per second (Tbps). The attack was part of a larger campaign aimed at industries such as internet services, finance, and telecommunications. The campaign consisted of over 100 hyper-volumetric DDoS attacks carried out over the course of a month, overwhelming network infrastructure with massive amounts of data.