Search This Blog

Powered by Blogger.

Blog Archive

Labels

Showing posts with label Phishing URL. Show all posts

Phishing URL Blocking Failure Leads to Cloudflare Service Disruptions

 


Yesterday, Cloudflare attempted to block an unintentional phishing URL within its R2 object storage platform, causing an outage that affected multiple services for nearly an hour. The outage was caused by an attempt to prevent spammers from accessing the URL. Its scalable and cost-efficient object storage service is comparable to Amazon's S3 and offers seamless integration into Cloudflare's ecosystem. 

As an S3-compatible storage service, the platform enables users to store their data across multiple locations, ensures data availability and reliability, and offers cost-free data retrievals, ensuring users can access their data without worries. A Cloudflare employee responded to an abuse report regarding a phishing URL hosted on its R2 platform, which caused the outage which occurred during the blackout. 

Inadvertently, the employee disabled the entire R2 Gateway service instead of restricting access to the specific endpoint, resulting in a significant service disruption. To prevent phishing URLs on the R2 platform, it accidentally resulted in a widespread outage of several Cloudflare services for almost an hour due to an attempt to block a phishing URL on the platform. 

Object storage solution Cloudflare R2 is no-egress-fee and has the same functionality as Amazon S3 and enables free data retrieval as well as S3 compatibility, replication, and seamless integration with other Cloudflare services to ensure efficiency and scalability in the storage of objects. In the incident which occurred late last week, Cloudflare employees responded to a complaint regarding a phishing URL hosted on the R2 platform.

However, the mitigation attempt resulted in an unintended disruption of service availability, which negatively impacted the operations of the platform. In the primary incident window of Cloudflare R2, all users were experiencing 100% failure rates when accessing their buckets and objects within the platform. Specifically, services that relied on R2 were experiencing higher error rates and operational failures as a result of their particular usage of the platform, as explained in the table below. 

Cloudflare R2 Object Storage and several related services were affected by an incident which took place from 08:10 to 09:09 UTC and lasted for 59 minutes. As a result of the impacted service failures, Stream experienced an entirely complete failure in video uploads and streaming, whereas Images experienced a 100% failure rate in uploads and downloads of images. During the week, Cache Reserve was completely down, raising origin requests to an all-time high. 

It has been observed that Vectorize experienced 75% failure rates for queries and failed to accomplish inserts, upserts, and deletes. It also experienced a 100% failure rate for insert, upsert, and delete operations. Log Delivery suffered delays and data loss, with up to 13.6% of all logs for R2-related jobs and up to 4.5% for non-R2 delivery jobs. Furthermore, the Key Transparency Auditor's signature publishing and reading operations were completely inoperable. Several other services were indirectly affected, experiencing partial disruptions, but they were not directly impacted. 

The error rates at Durable Objects increased by 0.9% following a service restoration due to reconnections, whereas Cache Purge experienced 1.8% more HTTP 5xx errors, as well as a tenfold increase in latency, as well as Workers & Pages experiencing a deployment failure rate of 0.002%, which was specifically affecting R2 projects only. As a consequence of the outage, all operations involving the R2 platform failed between 08:14 UTC and 09:13 UTC, meaning that 100% of operations involving R2 encountered errors. 

Services reliant on the R2 platform also saw an increase in the failure rate for operations that depend on it. During the period between 09:13 and 09:36 UTC, when R2 systems had recovered, and client connections had been restored, a backlog of requests caused a temporary increase in the operational load on the metadata layer of R2 based on Durable Objects. In North America, it was observed that there was only a 0.09% increase in error rates observed during this period, indicating that the impact was less severe. 

According to CloudFlare, the incident was primarily caused by human error and the absence of critical safeguards, such as validation checks for high-impact actions. The company has taken immediate corrective measures in response to the issue. These include removing the capability of disabling systems from the abuse review interface and limiting access to the Admin API so that internal accounts can no longer shut down services. 

Cloudflare's provisioning processes will be improved to reduce the risk of recurrence in the future, and stricter access controls will be enforced further to mitigate the risk of repeat incidents in the future. Additionally, two-party approval systems will be implemented for high-risk actions to further mitigate risk. The measures are intended to ensure the integrity of the system and prevent unintended interruptions of service as a result of these actions.