Facebook explains the shutdown of the backbone behind its massive outage

[ad_1]

The massive outage that ended Facebook, its associated services (Instagram, WhatsApp, Oculus, Messenger), its business platform, and the company’s own internal network all started with routine maintenance.

According to vice president of infrastructure Santosh Janardhan, a command issued during maintenance inadvertently shut down the backbone that connects all of Facebook’s data centers around the world.

That in itself is bad enough, but as we explained before, the reason you couldn’t use Facebook is because the DNS and BGP routing information pointing to its servers suddenly disappeared. According to Janardhan, this issue was a secondary issue, as Facebook’s DNS servers noticed the loss of connection to the backbone and stopped publishing the BGP routing information that helps every computer on the Internet find its servers. The DNS servers were still working, but they were unreachable.

Yesterday’s outage on our products was bad, so here we share more details about what exactly happened, how it happened and what we are learning from it: https://t.co/IXRt572h4c

– Mike Schroepfer (@schrep) October 5, 2021

The lack of network connections and the loss of DNS cut off the engineers’ servers trying to fix the issue and disabled many of the tools they normally use for repair and communication, as we heard yesterday.

The blog post notes that engineers encountered additional hurdles due to the physical and system security around this crucial hardware. Once they “enabled secure access protocols” (apparently not a code word for “opening the server door with an angle grinder), they were able to put the backbone in. line and slowly restore services by gradually increasing loads. This is part of the reason it took some people longer to regain access yesterday, as the power and compute requirements to turn everything on. at the same time could have caused more crashes.

So that’s it. No conspiracy theories, and no technicians picking up axes to secure the facility to rekindle Mark Zuckerberg’s baby. Just a bug in an order that an audit tool missed, and for six hours, the services that connect billions of people are gone.

[ad_2]

Source link