More AS4_PATH Triggered Global Routing Instability

By: Danny McPherson -

For those of you not paying attention, a slew of new instabilities in the global routing system are occurring – again.  These are presumably being tickled by another ugly AS4_PATH tunnel bug where someone [read: broken implementation] erroneously includes AS_CONFED_* segments in an AS4_PATH attribute – a transitive optional BGP attribute that’s essentially ‘tunneled’ between non-4-octet-AS-number speaking autonomous systems.

BGP Routing Instability - Update Frequencies

BGP Routing Instability - Update Frequencies - GMT -5

The problem is that when it’s unencapsulated at the receiving end, by a BGP router that could be several networks away, those AS_CONFED_* attributes aren’t supposed to be there, and can result in either a reset of the local session with the adjacent external BGP speaker from which the update was received, or may be propagated to an internal BGP peer, which will likely drop the session with the transmitting speaker — neither of which stop the problem at the source.  The prefix that appears to be causing all the fuss appears to be 193.5.68.0/23, a copy of the suspect update [courtesy of ras] available here.  Ras’s email provides some insight into what he’s seeing at the moment, it’s available in the juniper-nsp archive, linked below, and currently unavailable.

The relevant protocol specifications error handling procedures were rather vague in this area until recently.  There have been a couple drafts submitted to the IETF Inter-Domain Routing (IDR) WG that attempt to address the specific case outlined above, as well as more generically that of optional transitive attributes and error handling.   One of the drafts is an update to the original BGP Support for Four-octet AS Number Space specification, RFC 4893, with more explicit guidance and an expanded error handling procedures section.   Another draft, Error Handling for Optional Transitive BGP Attributes, attempts to be more prescriptive in general, and addresses a few specific issues in existing specifications as well.

Some more information on the original problem from Rob Shakir and others on the IDR mailing list can be found here and in nested references.  I first heard about this specific incident through an email from Richard A Steenbergen ‘ras‘ on the Juniper NSP mailing list, about one hour after the incident began.  Coincidentally, the web interface for the Juniper NSP mailing list seems to be having some reachability problems at the moment that may actually be related to this specific issue.  Some earlier text on the previous incident, as well as a related problem, are available in an earlier post on our blog here.

It is worth noting that the amount of instability resulting from this seems to be significant, but not catastrophic at this point – although that’s likely something that is very topologically depedent, and may very well not be the case for a few less fortunate folk.  Finally, this incident is still evolving, it’s being going on for about 4 hours now.  Let’s hope it’s squelched soon, and if any noteworthy updates emerge, I’ll be sure to provide updates here accordingly.

Comments

  1. Richard Steenbergen 03/18/2009, 12:35 am

    Now that things have settled down I have a little more information on the problem. There seem to be three main issues here:

    1) Routers shouldn’t be leaking AS_CONFED into AS4_PATH. Juniper has an existing PSN-2009-01-200 publication on this issue, and all code built after 2009-01-26 has been fixed to not leak these attributes. This root issue, an AS4_PATH of ( 65490 65410 ) 8758 196621 associated with the prefix 193.5.68.0/23, is captured here:

    http://www.paste-it.net/public/p698b04/

    2) Some routers will drop the BGP session upon receipt of an AS4_PATH with leaked confederation attributes. This behavior seems to have been responsible for impact to DSL users connected to Juniper E-series/ERX platforms (running JUNOSe). One workaround for this issue is to configure neighbor x.x.x.x lenient on the sessions you don’t want to drop.

    3) There appears to be another unrelated issue under JUNOS 9.1R1 (and possibly other versions) wherein the router accepted the invalid AS4_PATH attribute, but then propagated an extremely corrupted version of the update, causing all neighbors who received this update to drop the BGP sessions from these JUNOS 9.1R1 routers. This is what was captured here:

    http://www.paste-it.net/public/o83f44b/

    A trifecta of failure. All in all, a bad night for Juniper. :)