June 3, 2019

By Jonas Krogell, Technical Product Manager, Netrounds

Were you affected by the recent Google Cloud outage? Yeah, we were, too. Yesterday, on June 2nd, Google Cloud experienced a major issue with its services, tracked by Google themselves as Incident #19009A detailed root cause analysis has not yet been provided by Google, as far as we know.

A number of web services suffered from the major outages, predominantly in the US, with some issues reported from Europe as well. It took Google more than four hours to resolve the issues. Hours is far too much for Google customers who depend on their services to work. Here is an analysis of the outage, and what could have been done proactively to mitigate this issue in your network.

The Analysis

Just as many other modern-day companies, we use cloud services for development, research and production services. As we specialize in testing and monitoring, we have been able to capture data on how yesterday's issue affected their cloud customers.

For our own research we operate a number of Netrounds Test Agents running in the major public clouds such as AWS, GCP, and Azure in selected locations. These measure network availability and performance up to 1000 times per second, giving us millisecond resolution on real network and service performance and availability events.

For the purposes of this blog post we are initially going to focus on analyzing the results from three Test Agents located in the Google Cloud regions us-central1, us-west1 and us-east1 (as reflected in the graph below). All dates and times are based on the US/Pacific (PST) timezone. As a baseline we continuously run a full-mesh measurement between the three regions; below is how it normally looks in our GUI (data in the image was captured on the 1st of June 2019):

The green bars indicate that packet loss, delay, and delay variation/jitter are all normal for this 24-hour period, indicating that all measurements are below KPI thresholds and that no issues or abnormalities are observed. The one-way latency values are stable at ~18 ms us-east to us-central and ~33 ms us-east to us-west.

However, looking at the same view for the 2nd of June 2019 when the outage occurred, we instead see this:

The reddish/orange markings indicate issues. The Netrounds Test Agents have detected a KPI degradation and an SLA violation for traffic going from us-east1 to us-west1 and us-central1. The Netrounds Test Agents observed the first dropped packets at 11:49 and the last dropped packets at 14:43.

By zooming in to traffic between 11:00 and 16:00 during the outage, we are able to see more clearly where and when the issues are occurring.

Immediately we see that traffic from us-east to us-west/central was mainly affected. Traffic going in the opposite direction was not having issues with dropped packets. So the issue was asymmetric - which is typical for issues related to congestion. We also see that the problems were slightly more severe in the us-east to us-west direction compared to us-east to us-central.

Drilling down into details on the us-east->us-west stream we see this graph:

Here we see that the average latency started to increase at 11:48, and at the same time packet loss was observed. At 12:07 the latency was back to normal, but packet loss was still present; this may have been due to mitigation actions by Google to restore the service.

Doing the same drill-down into the us-east->us-central stream gives us this graph:

Here we see no slow build-up of latency but only intermittent packet loss - though not as frequent as in the us-east->us-west direction. In this direction we see a jump in latency from 18 ms to 21 ms at 12:42; this probably stems from re-routing in the Google network to mitigate the congestion issue. This slightly longer route was in use until 15:43, after which the latency went back down to the usual 18 ms.

Did Coast-to-Coast Traffic Leave the US During the Incident?

So far this looks like a normal congestion issue that affected traffic only in one direction. However, we did make a very interesting observation in the stream going from us-west to us-east; the latency/packet loss graph for this direction is seen here:

What is remarkable here is the latency jumps taking place at 13:22 and 13:53. The latency jumped from its usual 34 ms to a consistent 191 ms, without any packet loss occurring, staying at the higher level for 31 minutes.

The figure of 191 ms corresponds to a signal path of approximately 38,000 km (the speed of light in an optical fiber being ⅔ of that in vacuum). As the earth’s circumference is about 40,000 km, this suggests that traffic from US West coast to the East coast may have traversed the globe in the opposite direction during these 31 minutes.

My understanding is that traffic between GCP regions is normally encrypted and protected in several layers (https://cloud.google.com/security/encryption-in-transit/), meaning that security was likely not compromised; but it's still a remarkable and interesting event.

Unfortunately, GCP does not provide the option to do traceroutes internally in a VPC, or else the Netrounds Path Trace feature (https://www.netrounds.com/download/network-path-tracing-with-netrounds/) would have been able to show the exact route taken during this event.

Simultaneous Loss of us-east4 and us-west2

Additionally, we operate Netrounds Test Agent instances in us-east4 and us-west2. These were completely unreachable during the incident and reported 100% packet loss on all streams between 11:47 and 15:23.

us-east1 <-> us-east4 loss graph:

us-west1 <-> us-west2 loss graph 2:

It's surprising that we observed complete loss of traffic between us-west1 and us-west2 since the Google incident ticket describes the issue as being related to congestion on the east coast. The loss of traffic was not preceded by build-up of latency or loss - rather it came suddenly and without any previous warnings of congestion.

Responding Before Your Customers Complain

Being able to immediately discover, analyze and take relevant mitigation actions during outages is critical for running a professional 24/7 service. By conducting ongoing measurements of the network, the Netrounds Test Agents discovered, and alerted about, this cloud outage just a few seconds after it happened. This gives an unprecedentedly clear view of where an issue originates and how services are impacted, and it quickly ends any blame games so that teams can focus on troubleshooting and mitigating the incident to get services back online. 

Interested in more? Read our white paper SLA Exposé: Are Users Right in Complaining About Their Networks? 

Netrounds will also be at Cisco Live US next week in San Diego. Come visit us in the Cisco booth and don't forget to ask about Orchestrated Assurance or book a meeting.