Multi-thread performance study on Zen 3 and AMD Ryzen 5000

[ad_1]

One of the stories around the early generations of AMD Zen processors was the effect of simultaneous multi-threading (SMT) on performance. By operating with this mode enabled, as is the default in most situations, users saw a significant increase in performance in situations that could benefit from it. The reasons for this increase in performance are based on two competing factors: first, why is the kernel designed to be so underutilized by a thread, or second, to build an efficient SMT strategy in order to increase performance. In this review, we take a look at AMD’s latest Zen 3 architecture to observe the benefits of SMT.

What is simultaneous multi-threading (SMT)?

We often think of each processor core as being capable of processing a serial instruction stream for any running program. Simultaneous multithreading, or SMT, allows a processor to execute two simultaneous instruction streams on the same processor core, sharing resources and maximizing potential downtime on a set of instructions by having a secondary set for come in and take advantage of the underutilization. Two of the limiting factors in most computer models are compute or memory latency, and SMT is designed to interleave sets of instructions to optimize compute throughput while masking memory latency.

An old slide from Intel, which has its own marketing term for SMT: Hyper-Threading

When SMT is enabled, depending on the processor, it will allow two, four, or eight threads to run on that core (we’ve seen esoteric in-memory computing solutions with 24 threads per core). The instructions of any thread are rearranged to be processed in the same cycle and to keep the base resource usage high. Since multiple threads are used, this is called Thread Level Parallelism Extraction (TLP) from a workload, whereas a single thread with instructions that can run simultaneously is parallelism. at the instruction level (ILP).

Is SMT a good thing?

It depends on who you ask.

SMT2 (two threads per core) involves creating sufficient base structures to hold and manage two instruction streams, as well as managing how those base structures share resources. For example, if a particular buffer in your basic design is supposed to handle up to 64 instructions in a queue, if the average is less than that (say 40), then the buffer is underutilized and a design SMT will activate the buffer is fed on average up. This buffer can be increased to 96 instructions in the design to accommodate this, ensuring that if both instruction streams are operating at an “average” then both will have sufficient headroom. This means two useful threads, for only 1.5 times the size of the buffer. If everything else works, then that’s double the performance for less than double the base design in the design area. But in ST mode, where most of that 96-wide buffer is less than 40% full because the entire buffer has to be on all the time, it can waste power.

But, if a basic design benefits from SMT, then maybe the kernel was not designed optimally for single thread performance in the first place. If enabling SMT gives a user exact double performance and perfect scaling across the board, like there are two cores, maybe there is a direct issue with how the core is designed, from threads to buffers to cache hierarchy. It is known that users complain that they only get a 5-10% performance boost with SMT enabled, claiming that it does not work properly – it could simply be that the kernel is best designed for ST. Likewise, stating that a + 70% performance boost means that the SMT is performing well might be more of a signal to an unbalanced core design that wastes power.

This is the dichotomy of simultaneous multi-threading. If it works well, a user gets extra performance. But if it works too well, maybe it indicates a kernel not suitable for a particular workload. The answer to the question “Is SMT a good thing?” is more complicated than it seems at first glance.

We can divide the systems that use SMT:

Intel High Performance X86
AMD High Performance X86
IBM’s high performance POWER / z
Some high performance arm models
High-performance in-memory compute designs
High performance AI hardware

Compared to those who don’t:

Intel High Efficiency X86
All Arm Smartphone Class Processors
Successful designs based on high performance arms
Highly concentrated HPC workloads on x86 with compute bottlenecks

(Note that Intel calls its SMT implementation “HyperThreading,” which is a marketing term specifically for Intel).

At this point, we have only discussed SMT for which we have two threads per core, called SMT2. Some of the more esoteric hardware designs go beyond two thread-based SMTs per core and use up to eight. You’ll see it styled in the documentation as SMT8, compared to SMT2 or SMT4. This is how IBM approaches some of its designs. Some in-memory computing applications go as far as SMT24 !!

There is a clear trend between SMT compatible systems and systems without SMT, and this seems to be the hallmark of high performance. The only exception to this is the recent Apple M1 processor and Firestorm cores.

It should be noted that for systems supporting SMT, it can be disabled to force it to a per-core thread, to run in SMT1 mode. This has some major advantages:

It allows each thread to have access to a full core of resources. In certain workload situations, having two threads on the same core will mean sharing of resources and cause additional unintended latency, which can be important for latency critical workloads where deterministic (same) performance is required. . It also reduces the number of threads competing for L3 capacity, if that is a limiting factor. Software should also be needed to probe all other workflows for data, for a 16-core processor like the 5950X this means only reaching 15 other threads rather than 31 other threads, reducing the potential limited crosstalk. through heart-to-heart connectivity.

The other aspect is power. With only one thread on a core and no other thread to use if resources are underutilized, when there is a delay caused by fetching something from main memory then the core power would be lower, this that would allow other cores to ramp up in frequency. It’s a bit of a double-edged sword if the core is still high voltage while waiting for data in SMT mode off. SMT in this way can help improve performance per Watt, assuming that enabling SMT does not cause competition for resources and possibly more wait time for data.

Critical business workloads that require deterministic performance and some HPC code that requires large amounts of memory per thread often disable SMT on their deployed systems. Consumer workloads are often not as critical (at least in terms of scale and $$$), so the topic is often not covered in detail.

Most modern processors, when in SMT-enabled mode, if they execute a single instruction stream, will operate as if they are in SMT-off mode and will have full access to resources. Some software takes advantage of this, generating a single thread for each physical core in the system. Because basic structures can be dynamically partitioned (adjusts resources for each thread while threads are in progress) or statically shared (adjusts before a workload begins), situations where both threads on one kernel create their own bottleneck would benefit from having only one thread per active core. Knowing how a workload uses a core can help when designing software designed to use multiple cores.

Here is an example of a Zen3 kernel, showing all the structures. One of the points of progress with each new generation of hardware is to reduce the number of statically allocated structures within a core, as dynamic structures often offer the best flexibility and peak performance. In the case of Zen3, only three structures are still statically partitioned: the storage queue, the withdrawal queue and the micro-op queue. It is the same as Zen2.

SMT on AMD Zen3 and Ryzen 5000

Just like previous Zen processors from AMD, the Ryzen 5000 series which uses Zen3 cores also has an SMT2 design. By default, this is enabled in all consumer BIOS, but users can choose to disable it through firmware options.

For this article, we used our AMD Ryzen 5950X processor, a high performance 16-core Zen3 processor, in SMT Off and SMT On modes via our test suite and some industry standard benchmarks. The objectives of these tests are to verify the answers to the following questions:

Does disabling SMT have a single thread benefit?
What is the performance increase of SMT activation?
Is there a change in performance per watt when enabling SMT?
Does enabling SMT result in higher workload latency? *

* more important for enterprise / database / AI workloads

The best argument to enable SMT would be a No-Lots-Yes-No result. Conversely, the best argument against SMT would be a Yes-No-No-Yes. But since the basic structures were built with SMT in mind, the answers are rarely so clear.

Test system

For our test suite, due to obtaining new 32GB DDR4-3200 memory modules for Ryzen testing, we re-run our standard test suite on the Ryzen 9 5950X with SMT on and SMT off. Following our usual testing methodology, we test memory to official JEDEC specifications for each available processor.

Test setup
AMD AM4	Ryzen 9 5950X	MSI X570 Divine	1.B3T13 AGESA 1100	Noctua NH-U12S	ADATA 4×32 GB DDR4-3200
GPU	Sapphire RX 460 2 GB (processor testing) NVIDIA RTX 2080 Ti
PSU	OCZ 1250W Gold
SSD	Crucial MX500 2TB
THE	Windows 10 x64 1909 Specter and Meltdown patched
VRM equipped with Silversone SST-FHP141-VF 173 CFM fans

Thank you also to the companies who donated material for our test systems, including:

[ad_2]

Source link