Governance , Legacy Infrastructure Security , Security Operations
'Small Heart Attack' Incident Points to Problems Keeping Internet Infrastructure Secure(jeremy_kirk) • June 26, 2019 Verizon's store on Wall Street in New York (Source: Verizon)There tends to be a fair amount of irritation when the internet goes on the blink.
See Also: Webinar | The Future of Adaptive Authentication in Financial Services
That certainly was the case Tuesday during an incident that the networking services company Cloudflare described in a blog post as "a small heart attack."
Unsurprisingly, the problem revolved around Border Gateway Protocol, or BGP, the delicate protocol that stiches thousands of autonomous systems, or in laymen's terms, networks, that make up the internet. BGP is regarded as one of the most fragile but critical technologies underlying the internet.
BGP is designed to let networks announce the best paths to their resources. Those changes, which are called "announcements," are then propagated by other networks.
But BGP is like a row of dominoes. And if one network makes a mistake in announcing new routes, other networks down the line can pass on that mistake if the bad routes aren't filtered out. Errors are referred to as leaks.
Cloudflare put the blame squarely on Verizon for not adequately filtering erroneous routes announced by an ISP, DQE Communications, in Pennsylvania. It pulled no punches, saying there was no good reason for Verizon's failure other than "sloppiness or laziness."
"The leak should have stopped at Verizon," writes Tom Strickx, a Cloudflare network software engineer, in the blog post. "However, against numerous best practices outlined below, Verizon's lack of filtering turned this into a major incident that affected many Internet services such as Amazon, Linode and Cloudflare."
Routes Gone Bad
DQE used a BGP optimizer, which allows for more specific BGP routes, Strickx writes. Those more specific routes trump more general ones in announcements. DQE announced the routes to one of its customers, Allegheny Technologies, a metals manufacturing company. Then, those routes went to Verizon.
To be fair, the ultimate responsibility falls on DQE for announcing the wrong routes. Allegheny is somewhat to blame for pushing those routes on. But then Verizon - one of the largest transit providers in the world - propagated the routes. That's when it became messy.
Internet traffic destined for Cloudflare, Amazon Web Services and Google then went through DQE, Allegheny and Verizon. Subsequently, about 15 percent of Cloudflare's global traffic was affected during the most severe period.
A Cloudflare diagram shows how the route leak occurred.Strickx writes that there is a no-cost way to prevent taking up bad routes. Verizon could have had a limit on the number of prefixes - the term for blocks of IP addresses - that its routers are allowed to accept per BGP session. If the number of prefixes is too numerous - which may be a sign of an error - a router can reject taking the announcement, thus preventing a huge traffic choke point.
"Had Verizon had such a prefix limit in place, this would not have occurred," Strickx writes. "It is a best practice to have such limits in place. It doesn't cost a provider like Verizon anything to have such limits in place."
No Beef With Verizon
Cloudflare's sharp blog post, as well as how fast it posted it after the incident, caught many by surprise.
"Regarding those really aggressive claims, I was a bit shocked by that as well," writes one person on Hacker News, which has a commentary thread on the kerfuffle. The observer speculated that Cloudflare might have "some pre-existing beef with Verizon and is using this as an opportune moment to dump on them."
Not so, says Cloudflare CTO John Graham-Cumming.
"No one had an axe to grind with Verizon," Graham-Cumming writes in a response on Hacker News. "We were working a complex problem affecting a good chunk of our traffic and customers. Everyone was calm and collected and thoughtful throughout."
Cloudflare's legal department signed off the blog post, Graham-Cumming writes.
Verizon officials couldn't be immediately reached in Sydney on Wednesday for comment. But this isn't the first time I've contacted Verizon regarding irregularities with how it handles BGP.
Last year, I reported that Verizon carried erroneous routing announcements that allowed traffic for Australian defense websites to go through China Telecom. The situation carried on for some 30 months, and no clear answer emerged as to why Verizon allowed it to carry on for so long. There are security risks in routing traffic through nations such as China (see: Did China Spy on Australian Defense Websites?).
How to Mitigate BGP Risks
There have been plenty of warnings by experts over the shortcomings in BGP, which has its roots in the 1970s. But as Strickx points out, there are ways to mitigate the risks, such as tapping into the Internet Routing Registry, which maintains databases of prefixes that networks can refer to avoid accepting wrong announcements by other providers.
Cloudflare's diagram explaining how IRR and RPKI would have stopped the erroneous route leak.Also, there's Resource Public Key Infrastructure, which uses a system of digital certificates to verify that a set of IP addresses belongs to a network provider and that the provider is allowed to make related BGP announcements.
While many networks use IRR and RPKI, an internet-wide upgrade of all networks remains an elusive goal. One consolation is that BGP mistakes usually don't go unnoticed for long. But be prepared for more small heart attacks along the way.