How exactly does SMT work?

Firestrider · Mar 20, 2009

On the core-level, what is happening? How can you have multiple threads on a single core? How does this bring performance if the number of execution units doesn't change?

mbigna · Mar 20, 2009

It doesn't, really, help much. Intel way over hyped 'hyperthreading' as a marketing ploy. It's really nothing more than time-sharing on a single CPU. A CPU can only process x number of executions per unit time. Dividing CPU processing power among multiple threads doesn't change that execution rate. However, many applications spend a lot of time just waiting around for input from somewhere else, so if another thread that does need to do something can get attention, that would help a little. However, much of the thread priority decision making is done through the O/S, so there would have to be quite a bit of cooperation between the O/S and the CPU for something like Hyperthreading to do any good.

CPUs, for the longest time, have had on-board cache and predictive execution which is where the most benefit can be acquired. CPUs already try to guess which instructions are going to be needed next and load them into the pipeline in the hope that those instructions will, indeed, be the next ones executed. Otherwise, if the CPU guesses wrong, there is penalty incurred as the instruction pipeline has to be flushed and reloaded with the correct instructions.

Firestrider · Mar 20, 2009

Then how is there a huge improvement in performance when the processor is fully loaded between HT and non-HT in highly threaded applications?

Is the CPU really stalled ~17% of the time even in this application?

What I'm guessing is executing instructions is faster than fetching and decoding them, and sometimes the instruction pipeline gets flushed like you said on a branch misprediction. So SMT is a method on keeping the execution units feed?

mbigna · Mar 20, 2009

This chart compares single to multi-threaded execution on multiple cores. There is no comparison of the same application with hyperthreading turned on or turned off.

Your chart is showing mainly the benefits of multiple cores. It is expected that when a multi-threaded application is allowed a separate CPU for each thread, you would expect that sort of performance enhancement.

Firestrider · Mar 21, 2009

Look at Core i7 965 Extreme. There is a graph for it with HT on and HT off.

cyberfish · Mar 21, 2009

Like mbigna said, there are gains from branch-mispredictions (and pipeline stalls as a result). I don't think I/O is it. When a program calls a blocking I/O function (a function that doesn't return until data arrives), the OS will switch it out of context anyways. I think the biggest gain comes from cache-misses, since the CPU can do other things while a thread needs some data from memory (which is comparatively very slow). A context switch in that case is not practical in this case because a context switch is quite expensive (saving the registers and the stack, and restore them after), but by duplicating certain parts of the CPU (registers, including the instruction pointer and stack pointer), it can accept 2 (or more) threads at a time from the OS, and when one thread is waiting for a RAM access, the CPU can work on the other thread (this switch is nearly free since registers are duplicated).

As for real world differences, I guess it will benefit programs that do a lot of random accesses (no spatial or temporal locality), such as programs that use big hash tables, or long linked lists with elements allocated from the heap.

AlabamaCajun · Mar 21, 2009

HT loads up 1 core and it's shared L1 and L2 with one or more programs. Depending on the compiler, 1 program may take advantage of it. All modern processors have some sort of queued and out of order execution but HT takes it a little further doubling up on some parts of the execution unit. Programs like Cinibench are duplicating parts of the program and threading out to several cores. This actually allows simultaneous multi-threading on the number of actual cores. HT is not true SMP in the multi processing realm but it is SMT since the threads can be dispatched onto the same core.

Fire strider, I would not say huge improvement but it is a nice improvement. I would call 24K+ a huge improvement. You doubled the pipeline feed for four cores but only got 17% boost. Doubling cores yields about 75 to 90% boost depending on CPU and programs so I would expect 25 to 35 from HT. In my honest option regardless of who offers HT on their CPU I would not get it for desktop gaming. I did like it on my old P4 when it was a good boost for software developers but today multi-cores do soo much more. It did nothing back in the single core days for games and appears to slow down single threaded apps today. If you run CAD or Rendering tools then you should see some benefit from it. The cost of having it may be another kick if you rises the cost by more than 17%.

Not to start a piess contest but the Phenom IIs are missing from that sampling. For those interested, Ph-2s score between the Q93xx and the I920 or better.
http://www.ocforums.com/showpost.php?p=5934305&postcount=3

mbigna · Mar 21, 2009

Firestrider said:
Look at Core i7 965 Extreme. There is a graph for it with HT on and HT off.

Ah, you are correct. I missed that one.

There is quite a dependency upon how well a program is written as well as how good the compiler is for producing executable code that runs well in multiple threads and/or on multiple cores.

A benchmarking program like Cinebench isn't necessarily the best means of estimating real-world performance because we already know that certain compilers are written to maximize benchmarks as well as the fact that certain benchmark programs are written so as to favor one processor (or feature--like hyperthreading) over another.

That being said--a rendering application (like Cinebench) can easily distribute processing among multiple threads by dividing an image (or multiple frames for video) into sections and letting each thread work on its own section. Bottlenecks occur at the boundaries of these sections since data from both sides of a boundary need to be available at the same time so that interpolations between them can be calculated. One thread can be stalled if it is waiting for the other thread in an adjacent section to finish its boundary calculations.

Consider an image being rendered with two threads on a dual core processor. The image is divided into two and each thread gets to run on its own processor (omit O/S overhead for this example). A 'smart' rendering application will divide the image along the shortest dimension (a landscape image should be divided into left and right since height is shorter than length) to minimize the boundary. Unless the image is the same on both sides, one section will likely be completed before the other. [Whether or not the application performs boundary calculations pixel by pixel or waits to do all the boundary calculations after all the rest of the section is completed--while not irrelevant--still leaves one of the threads periodically waiting on the other.] These sorts of issues help explain why (on top of O/S overhead) adding cores to a CPU won't increase performance by 100% of the multiple of the number of cores added (dual core won't perform twice as fast as a single core and a quad core won't perform four times as fast as a single or twice as fast as a dual core--it will be something less).

Now, let's add Hyperthreading to the example. What Hyperthreading does is 'fake' the O/S and application into thinking there are more cores than there actually are. For the sake of this example we'll assume that four cores will be simulated on two actual cores. The application will divide the image into four sections--one for each of four threads that it will spawn. The image will likely be divided by one horizontal line and one vertical line. [The rendering program might actually divide the image into four horizontal strips instead of four quadrants. It might even divide the image into irregular areas since certain parts might be more complex than others. The point, however is that the number/length of boundaries has at the very least, doubled.]

So, what then, does Hypertheading buy you? Well, if one of the threads is stalled at a boundary (and assuming the application is written to take advantage), that thread can signal that it is waiting and another thread that is not being executed can be swapped in so that it might continue.

What are the drawbacks? Well, first of all, the number and length of the boundaries have, at least, doubled. So the number of boundary calculations have increased--even as the actual number of cores has not. The increase of boundaries also increases the chance of a thread being stalled out. And, each time a thread stalls, there is a cost in terms of signaling and swapping out one thread for another. In fact, there will be regular thread swapping even if a thread is not stalled (this is what 'time sharing' is, after all). Furthermore, it might very well be possible for the thread that is being swapped in for a stalled thread is itself already stalled because it is waiting for data from the fourth thread that hasn't completed its calculations because it has been swapped out for an inordinate amount of time. Consider the situation where thread 1 is stalled at a boundary because it is waiting on thread 2 to work its way to the boundary of section 1 and 2. But thread 2 is stalled at a boundary between section 2 and 3 because thread 3 hasn't worked its way to that boundary. Same for threads 3 and 4 and also for threads 4 and 1. These kinds of Critical Path Locks can usually be programmed around, but can cause severe 'thrawping'/('swashing') (combine thrashing with swapping--where multiple threads are continually being swapped in that are immediately signaling that they are stalled only to be replaced by another stalled thread) when Hyperthreading is added to the mix.

This is why Hyperthreading can be such a mixed bag. There are certain (but relatively few) applications that lend themselves to being able to benefit from such an arrangement while there are other applications that will actually slow down significantly if Hyperthreading is turned on. I've always questioned the wisdom of implementing Hyperthreading on chips since it adds a significant amount of complexity (and therefore cost, as well), at what most would consider insignificant performance gains.

cyberfish · Mar 21, 2009

I don't think compilers really matter. Threading APIs (at least the conventional ones like Win32 threads, POSIX threads, Boost::thread) are basically wrappers around OS routines, and programmers have to manage threads themselves explicitly. The program just creates threads, it's up to the OS scheduler to decide when to run them. Unless you are using something like OpenMP, where the compiler can make threading decisions (not sure about this), but it's still very new, and most programs still use conventional APIs for finer control and better support (GCC, for example, only supports OpenMP starting with GCC 4.2).

Shiggity · Mar 22, 2009

What interested me a lot was this idea of scalable SMT that Intel is working on with Larrabee.

Where cores will break themselves down thousands of times.

1 core = 4 threads

1 thread = 8 fibers

1 fiber = multiples of 16 strands per fiber

*resource availability does not slow down the creation of new threads* (I'm under the impression that regular GPU's do)

Every single Larrabee core could be running 1000+ threads, 32+ core versions, yeah that's alot of SMT

This is why Hyperthreading can be such a mixed bag. There are certain (but relatively few) applications that lend themselves to being able to benefit from such an arrangement while there are other applications that will actually slow down significantly if Hyperthreading is turned on. I've always questioned the wisdom of implementing Hyperthreading on chips since it adds a significant amount of complexity (and therefore cost, as well), at what most would consider insignificant performance gains.

Probably mostly for energy efficiency gains and HPC. I agree that loads of cores is just not needed at the PC / desktop / laptop / netbook level.

KTE · Apr 6, 2009

I've seen this answered a few times but to add to what's already been said, the reasons are very basic. Take a look at the Core 2 core block level diagram. Core 2 was already an absolute monster but one that was not being exploited well by applications as well as one harboring heavy bus and memory limitations. It was already able of very similar core performance. SMT was added to make use of the core execution power there but not being used by applications since no single application on the Desktop or Server 2S segment was sustaining higher than approximately 1.9 IPC. Infact, it's actually much lower.

The second problem as touched on above was branch stalls were lowering the core IPC even more leaving a core where most of the individual blocks are idling for lengthy periods. That ends up in lengthy unnecessary execution delays. By being able to share resources and parallelize execution of a workload, you end up increasing the efficiency by getting more work done contemporaneously. We users see that as faster performance. That's what Intel added with SMT, shared a workload. At the back-end, this was only possible because for most workloads there's enough execution power idling. Even the present concept can be tweaked and fine tuned quite a bit more yet, specifically by increasing the multi-thread resource pools and slightly bigger/faster L3. I like the design very much.

wingman99 · Apr 8, 2009

Didn't intel improve the cache in core 2 and i7 to help reduce stalling with hyperthreading.

How exactly does SMT work?

Firestrider

Member

mbigna

Member

Firestrider

Member

mbigna

Member

Firestrider

Member

cyberfish

Member

AlabamaCajun

Member

mbigna

Member

cyberfish

Member

Shiggity

Member

KTE

Member

wingman99

Member

Similar threads