Last week’s major internet outage in the US, affecting services like Google and AWS, was caused by a simple programming error—a null pointer dereference—in Google’s service control system triggered by an untested policy change. Despite a swift response from Google’s engineering team, the incident highlighted the fragility of modern internet infrastructure and the risks of centralized cloud dependencies.
Last week, a significant internet outage occurred in the US, lasting about three hours and affecting major services like Google, AWS, Spotify, and Cloudflare. This widespread disruption was initially suspected to be a cyberattack, reminiscent of past incidents like the 2016 Dyn attack. However, it was later revealed that the root cause was a simple programming error—a null pointer dereference—in Google’s service control system, which manages API authorization and quota checks across their cloud infrastructure.
The issue originated from a new feature added to Google’s service control binary on May 29, 2025, which was rolled out region by region. This feature included additional quota policy checks but was never fully tested in production because the specific code path that contained the bug was not triggered until a policy change was made on June 12. This policy change included blank fields that caused the service control binary to crash due to the null pointer dereference, leading to a cascading failure across Google’s cloud services and, consequently, much of the internet.
A null pointer dereference occurs when a program tries to access memory at address zero, which is invalid and causes the program to crash. In this case, the service control binary failed to handle a null pointer returned when looking up a non-existent field in a JSON object. This simple oversight caused the entire system to fail. Although Google had implemented a “red button” kill switch to disable problematic code paths quickly, the lack of feature flagging meant the buggy code was fully active globally, amplifying the impact.
Google’s site reliability engineering team responded swiftly, identifying the root cause within 10 minutes and deploying the kill switch within 40 minutes. However, recovery was slowed by a “herd effect,” where simultaneous restarts of service control instances overwhelmed the underlying infrastructure, particularly in larger regions like US Central 1. It took nearly three hours for full recovery, highlighting the challenges of managing large-scale cloud infrastructure during outages.
The incident underscores the internet’s growing dependence on a few major cloud providers, creating single points of failure that contradict the internet’s original decentralized design. While programming languages like Rust offer features to reduce null pointer errors, they are not immune to crashes caused by other types of bugs. Ultimately, this outage serves as a cautionary tale about the fragility of modern internet infrastructure and the importance of robust error handling and gradual feature rollouts.