Tag Archives : outage September 2010

Facebook Outage

By: Craig Labovitz -

Another Facebook outage, an outpouring of tweets, press articles and an obligatory ATLAS post below.

We use ATLAS data to graph Facebook (AS32934) traffic with 80 ISPs around the world between 5pm September 22 and 5pm EDT today. You can see Facebook traffic plummet around 1:30pm and return shortly after 4pm. From a quick glance at the data, the outage appears to be global (impacting all of the 80 ISPs).


We have no information on the root cause (no sign of obvious BGP instability or DDoS).

Lots of speculation on twitter.

UPDATE 8:30pm Sept 23: Facebook explains this was an internal configuration management problem.

– Craig

 

A Brief Look at Facebook Outage

By: Craig Labovitz -

Since we’ve written about Google’s multiple past outages (e.g., the GoogleLapes of May 2009 and the more recent Google Blip), it seems only fair to quickly cover Facebook’s problems last Friday.

The below graph shows coarse grain Facebook (ASN 32934) traffic statistics from 60 randomly selected ISPs around the world. While most press / blog coverage (e.g. Gigaom’s “Facebook Sees Major Outage”) pegged the disruption at 5:30 pm ET, the traffic data suggest Facebook’s problems began much earlier in the day.

facebook_outage

Normally, Facebook’s diurnal traffic follows the same pattern as other social media and interactive consumer sites. Generally, Facebook traffic reaches a low over night at 2am and then grows to its daily peak at 5pm EDT before declining briefly before a second smaller peak at 9pm ET (the peaks likely matching the North American end of work day and prime time across PDT and EDT).

But beginning Friday morning at 2am, Facebook saw dozens of modest traffic drops (each of a few Gigabits) until plumitting 30 Gbps at 5pm EDT for roughly twenty minutes.

What happened to Facebook?

While there is no shortage of speculation on Twitter and operations mailing lists, Facebook so far is not saying. I think a recent post to an engineering outage discussion list sums up the situation:

“Given Facebook’s complexity, who knows what the problem was. Load balancer or layer 7 filter/re-writer (think F5) issues? Back-end server problems? Software misconfiguration? … Some developer deciding to just roll something out in the middle of the day (as is quite common with social networking sites these days)? We’ll probably never know.”

Facebook has come a long way from a few hundred Harvard freshman looking for dates. As Facebook accelerates past 400 million users and pursues goals of nothing short of taking over the web, the social media giant has become critical infrastructure — at least from the perspective of millions of consumers and ISP support desks.

In an upcoming series of blogs, we’ll explore the growing Internet infrastructure footprint of Facebook, Google and other dominant Internet content companies.

 

Google Blip

By: Craig Labovitz -

While Google’s YouTube outage today generated a steady stream of tweets and blog posts, a quick look at traffic across 50 or so small / mid-size ISPs around the world suggests this was more of a “blip” than a global outage.


twitter

Certainly the outage was nowhere as large nor prolonged as the great “GoogleLapse” last year.

Below is a graph of traffic originating in Google (AS 15169) over the last 24 hours using data from 50 ISPs around the world selected at random. All times are EDT. Looks like a small outage overnight preceded the larger traffic 8am EDT drop-off.

Google Blip

And a quick aside, my intent is not to pick on Google (unless, of course, they do not pick Ann Arbor) — all providers have outages. I just find Google an especially interesting case study given their size and overall impact on the Internet.

The Great GoogleLapse

By: Craig Labovitz -

Web sites go down. Circuits fail. Network engineers goof router configs. And few of these outages ever make the nightly news…

But if you happen to be Google and your content constitutes up to 5% of all Internet traffic, people notice.  Network engineers around the world frantically email traceroutes to mailing lists. IRC channels fill with speculation (“definitely was a DDoS attack”, “no, a worm”, “it was ISP xxx’s  fault!”). And end users Twitter (a lot).

So what does it look like when 5% of the Internet disappears on an otherwise uneventful Thursday morning? The below graph shows average traffic across 10 tier1/2 ISPs in North America from Google’s network (ASN 15169). Outage began roughly at 10:15am and lasted through 12:15pm EDT.

Looking at the data, most large transit providers appear to have been impacted (e.g., Level3, AT&T, etc.). Other providers (e.g. large consumer DSL / Cable) showed no drop in traffic from/to Google.

Looking at BGP (below snapshot is from Arbor’s Routeviews Servers) we see a lot of churn in Google’s BGP routes around the outage timeframe — one prefix I choose at random flaps across half a dozen providers before getting withdrawn.

In a recent official company blog post, Google blamed some combination of airplanes and BGP for the outage.

Reblog this post [with Zemanta]