Linus Tovalds accuses Intel of killing ECC RAM in consumer systems



[ad_1]

This site can earn affiliate commissions from the links on this page. Terms of use.

Linus Torvalds isn’t happy with the way Intel has handled Error Correcting Code (ECC) memory support, and he blames the silicon giant for essentially killing the technology outside of servers. ECC memory is used to capture and correct single bit errors in memory. It cannot correct multi-bit errors, but just correcting a single bit can make a significant difference to system stability.

There was a time when you could buy ECC support on consumer chipsets, but Intel phased out this capability on non-Xeon platforms a few years ago. The 975X may have been the last mainstream Intel platform to support it, and this family was launched 15 years ago. The Xeon 3450 chipset was compatible with some high-end processors in the Nehalem family, but it’s still a Xeon chipset – not a mainstream part.

As a result, the support for ECC in consumer products – and the availability of ECC RAM for consumer products – both fell off a cliff. Linus sums up his case in a rather lengthy article, claiming that Rowhammer’s continued persistence and the fact that single-bit errors never went away to declare Intel’s ECC policies “bad and misguided.” He is actually attacking the entire DRAM industry, writing:

Memory makers claim it is because of the economy and low power. And they’re lying bastards – let me once again point out how these issues have been around for several generations already, but these f * ckers have luckily sold broken hardware to consumers and claimed it was a “Attack”, while it always was. “We are taking shortcuts.

Torvalds also refers to numerous incidents of kernel “oopsies” which he believes could be best explained by a hardware error. While objective data on this stuff is hard to find, a 2009 Google report on memory errors provides evidence that it is right, although it is obvious that a 2009 article may have applicability. limited to DDR4 RAM in 2020.

Image from Wikimedia Commons, by Kjerish. CC BY-SA 4.0

Google’s conclusion from 2009 was simple: “We found that the incidence of memory errors and the error rate range between different DIMMs (dual in-line memory modules) were much higher than this. that was previously reported… Memory errors are not uncommon. ” The team has detected error rates that it describes as “orders of magnitude higher than before.”

They conclude: “Error correction codes are essential in reducing the large number of memory errors to a manageable number of uncorrectable errors.”

Current limited value support from AMD

On paper, AMD’s Ryzen family unofficially supports ECC (Threadripper has official ECC support). However, as Ian Cutress points out later in the thread, it’s not because a motherboard claims ECC support is enabled. We don’t come across this situation very often, but processors and motherboards report their different feature sets through registers, which applications like CPUID then check to determine and report which features a chip supports. An application claiming to check to make sure a given feature is supported (SSE, AVX, ECC, etc.) can only report what the CPU or motherboard claims about its own operation via registry flags. It can’t actually verify that the support exists, unless the app actually contains a functionality test – like, say, a small benchmark that literally can’t work unless AVX support is working.

Since AMD’s support is unofficial, that means no one is standing in front of OEMs with a whip to make sure they’re implementing the feature correctly, and not testing to make sure that the feature actually works. Because it is possible to set the bit for “Support ECC” in a motherboard register without actually implementing functional ECC, there are motherboards that claim to support the standard and appear to do so if you scan them with a utility. , but don’t actually do it implement ECC at all. The only way to ensure that ECC compatibility works on an AMD Ryzen motherboard is to run a utility that forces an ECC error.

As to whether we’ll see the feature make a return to Intel desktops or officially debut for Ryzen, it’s unclear. This would require buy-in from memory manufacturers, and it’s not clear that many people in the PC market would buy into it. Most people buy on price, and since you never know about PC failures that you don’t have, it’s hard to sell people the benefit. Then again, we’re going to see x86 processor makers facing much more difficult challenges from ARM over the next 2-5 years than we’ve ever seen before. It wouldn’t be surprising to see Intel and / or AMD “rediscovering” certain features, especially if those features allow them to claim increased stability over previous products.

The characteristic image shows registered DDR4-2133 DIMMs. Registered DIMMs often also support ECC, but unbuffered ECC RAM can also be found.

Now read:



[ad_2]

Source link