Look at Core i7 965 Extreme. There is a graph for it with HT on and HT off.
Ah, you are correct. I missed that one.
There is quite a dependency upon how well a program is written as well as how good the compiler is for producing executable code that runs well in multiple threads and/or on multiple cores.
A benchmarking program like Cinebench isn't necessarily the best means of estimating real-world performance because we already know that certain compilers are written to maximize benchmarks as well as the fact that certain benchmark programs are written so as to favor one processor (or feature--like hyperthreading) over another.
That being said--a rendering application (like Cinebench) can easily distribute processing among multiple threads by dividing an image (or multiple frames for video) into sections and letting each thread work on its own section. Bottlenecks occur at the boundaries of these sections since data from both sides of a boundary need to be available at the same time so that interpolations between them can be calculated. One thread can be stalled if it is waiting for the other thread in an adjacent section to finish its boundary calculations.
Consider an image being rendered with two threads on a dual core processor. The image is divided into two and each thread gets to run on its own processor (omit O/S overhead for this example). A 'smart' rendering application will divide the image along the shortest dimension (a landscape image should be divided into left and right since height is shorter than length) to minimize the boundary. Unless the image is the same on both sides, one section will likely be completed before the other. [Whether or not the application performs boundary calculations pixel by pixel or waits to do all the boundary calculations after all the rest of the section is completed--while not irrelevant--still leaves one of the threads periodically waiting on the other.] These sorts of issues help explain why (on top of O/S overhead) adding cores to a CPU won't increase performance by 100% of the multiple of the number of cores added (dual core won't perform twice as fast as a single core and a quad core won't perform four times as fast as a single or twice as fast as a dual core--it will be something less).
Now, let's add Hyperthreading to the example. What Hyperthreading does is 'fake' the O/S and application into thinking there are more cores than there actually are. For the sake of this example we'll assume that four cores will be simulated on two actual cores. The application will divide the image into four sections--one for each of four threads that it will spawn. The image will likely be divided by one horizontal line and one vertical line. [The rendering program might actually divide the image into four horizontal strips instead of four quadrants. It might even divide the image into irregular areas since certain parts might be more complex than others. The point, however is that the number/length of boundaries has at the very least, doubled.]
So, what then, does Hypertheading buy you? Well, if one of the threads is stalled at a boundary (and assuming the application is written to take advantage), that thread can signal that it is waiting and another thread that is not being executed can be swapped in so that it might continue.
What are the drawbacks? Well, first of all, the number and length of the boundaries have, at least, doubled. So the number of boundary calculations have increased--even as the actual number of cores has not. The increase of boundaries also increases the chance of a thread being stalled out. And, each time a thread stalls, there is a cost in terms of signaling and swapping out one thread for another. In fact, there will be regular thread swapping even if a thread is not stalled (this is what 'time sharing' is, after all). Furthermore, it might very well be possible for the thread that is being swapped in for a stalled thread is itself already stalled because it is waiting for data from the fourth thread that hasn't completed its calculations because it has been swapped out for an inordinate amount of time. Consider the situation where thread 1 is stalled at a boundary because it is waiting on thread 2 to work its way to the boundary of section 1 and 2. But thread 2 is stalled at a boundary between section 2 and 3 because thread 3 hasn't worked its way to that boundary. Same for threads 3 and 4 and also for threads 4 and 1. These kinds of Critical Path Locks can usually be programmed around, but can cause severe 'thrawping'/('swashing') (combine thrashing with swapping--where multiple threads are continually being swapped in that are immediately signaling that they are stalled only to be replaced by another stalled thread) when Hyperthreading is added to the mix.
This is why Hyperthreading can be such a mixed bag. There are certain (but relatively few) applications that lend themselves to being able to benefit from such an arrangement while there are other applications that will actually slow down significantly if Hyperthreading is turned on. I've always questioned the wisdom of implementing Hyperthreading on chips since it adds a significant amount of complexity (and therefore cost, as well), at what most would consider insignificant performance gains.