What brought down Facebook: Major global outage continues



[ad_1]

The old saying about network troubleshooting is that when something goes wrong, “it’s DNS”. This time around, the Domain Name Server (DNS) appears to be the symptom of the root cause of Facebook’s overall failure. The real cause is that there are no functional Border Gateway Protocol (BGP) routes in Facebook sites.

BGP is the standardized exterior gateway protocol used to exchange routing and accessibility information between Internet top-level Autonomous Systems (AS). Most people, indeed most network administrators, never need to deal with BGP.

Many people noticed that Facebook was no longer listed on DNS. Indeed, there was joke posts offering to sell you the Facebook.com domain.

Cloudflare VP Dane Knecht was the first to report the underlying BGP problem. That meant, as Kevin Beaumont, former head of Microsoft’s security operations center, tweeted: “By not having BGP advertisements for your DNS name servers, DNS collapses = no one can find you on the internet. Ditto with WhatsApp for that matter. Facebook has essentially moved away from its own platform. ”

Whoops.

As annoying as it can be for you, it can be even more annoying for Facebook employees. There are reports that Facebook employees cannot enter their buildings because their “smart” badges and doors were also disabled by this network failure. If this is true, the folks at Facebook literally can’t go into the building to fix things.

Meanwhile, Reddit user u / ramenporn, who claimed to be a Facebook employee working to bring the social network back from the dead, reported, before deleting his account and messages, that “DNS services for FB have been affected and this is probably a symptom of the actual problem, and that is that BGP peering with Facebook peering routers has gone down, most likely due to a configuration change that went into effect shortly. time before blackouts (started at around 1540 UTC). ”

He continued, “There are now people trying to access peering routers to implement fixes, but people with physical access are distinct from those who know how to authenticate to systems and devices. people who actually know what to do, so there is now a logistical challenge to unify all this knowledge. This is also partly due to the reduction in staff in data centers due to the pandemic measures. ”

Ramenporn also said it was not an attack, but an erroneous configuration change made through a web interface. What really stinks – and why Facebook is still down hours later – is that since BGP and DNS are down, the “connection to the outside world is cut, remote access to these tools no longer exists. , so the emergency procedure is to get physical access to the peering routers and do all the configuration locally. ” Of course, the technicians on site don’t know how to do it and the senior network administrators are not on site. It is, in short, a big mess.

As a former network administrator who has worked the internet at this level, I predict that Facebook will be down for hours longer. I suspect this will end up being Facebook’s longest and most serious failure to date before it’s fixed.

Related stories:



[ad_2]

Source link