For those of you not paying attention, a slew of new instabilities in the global routing system are occurring – again. These are presumably being tickled by another ugly AS4_PATH tunnel bug where someone [read: broken implementation] erroneously includes AS_CONFED_* segments in an AS4_PATH attribute – a transitive optional BGP attribute that’s essentially ‘tunneled’ between non-4-octet-AS-number speaking autonomous systems.
The problem is that when it’s unencapsulated at the receiving end, by a BGP router that could be several networks away, those AS_CONFED_* attributes aren’t supposed to be there, and can result in either a reset of the local session with the adjacent external BGP speaker from which the update was received, or may be propagated to an internal BGP peer, which will likely drop the session with the transmitting speaker — neither of which stop the problem at the source. The prefix that appears to be causing all the fuss appears to be 18.104.22.168/23, a copy of the suspect update [courtesy of ras] available here. Ras’s email provides some insight into what he’s seeing at the moment, it’s available in the juniper-nsp archive, linked below, and currently unavailable.
The relevant protocol specifications error handling procedures were rather vague in this area until recently. There have been a couple drafts submitted to the IETF Inter-Domain Routing (IDR) WG that attempt to address the specific case outlined above, as well as more generically that of optional transitive attributes and error handling. One of the drafts is an update to the original BGP Support for Four-octet AS Number Space specification, RFC 4893, with more explicit guidance and an expanded error handling procedures section. Another draft, Error Handling for Optional Transitive BGP Attributes, attempts to be more prescriptive in general, and addresses a few specific issues in existing specifications as well.
Some more information on the original problem from Rob Shakir and others on the IDR mailing list can be found here and in nested references. I first heard about this specific incident through an email from Richard A Steenbergen ‘ras‘ on the Juniper NSP mailing list, about one hour after the incident began. Coincidentally, the web interface for the Juniper NSP mailing list seems to be having some reachability problems at the moment that may actually be related to this specific issue. Some earlier text on the previous incident, as well as a related problem, are available in an earlier post on our blog here.
It is worth noting that the amount of instability resulting from this seems to be significant, but not catastrophic at this point – although that’s likely something that is very topologically depedent, and may very well not be the case for a few less fortunate folk. Finally, this incident is still evolving, it’s being going on for about 4 hours now. Let’s hope it’s squelched soon, and if any noteworthy updates emerge, I’ll be sure to provide updates here accordingly.