Facebook attributes failure to error during routine maintenance

[ad_1]

This file photo from February 19, 2014, shows icons of WhatsApp and Facebook apps on a smartphone in New York City. (AP Photo / Patrick Sison, File)

LONDON (AP) – The global blackout that took Facebook and its other platforms offline for hours on end was caused by an error during routine maintenance, the company said.

Santosh Janardhan, vice president of infrastructure for Facebook, said in a blog post that Facebook, Instagram and WhatsApp were obscuring themselves “not by malicious activity, but by mistake of our own.”

The problem arose while engineers were doing daily work on Facebook’s global backbone; computers, routers and software in its data centers around the world and the fiber optic cables that connect them.

“During one of these routine maintenance jobs, an order was issued to assess the availability of global backbone capacity, which unintentionally disrupted all connections in our backbone, effectively disconnecting them. Facebook data centers around the world, ”Janardhan said Tuesday.

Facebook’s systems are designed to catch such errors, but in this case, a bug in the audit tool prevented it from properly stopping the command, Janardhan said.

This change also triggered a second issue that made matters worse by making it impossible to access Facebook’s servers even if they were up and running.

Engineers were quick to fix the issue on the spot, but it took a while because of the extra layers of security, Janardhan said. Data centers are “hard to get to, and once inside, the hardware and routers are designed to be hard to change even when you have physical access.”

Once connectivity was restored, services were gradually reestablished to avoid spikes in traffic that could cause further crashes.

This was an “unforeseen bug” for a faulty maintenance update to remove Facebook’s backbone, but the company likely could have avoided a scenario in which its servers were completely disconnected, making it impossible to access. tools needed to fix it, said Angelique Medina, of ThousandEyes of Cisco Systems, a company that monitors Internet outages.

“The big question is why so many internal tools and systems could have a single source of failure,” said Medina. “Facebook would still have been down because of the network outage, but they could have resolved the outage sooner if they had internal access.”

[ad_2]

Source link