PDA

View Full Version : According to Tom's Hardware . . .


zachj
09-27-04, 03:29 PM
"The activation of the 64 bit mode does, in practice, have one noteworthy disadvantage: the floating-point unit is thereby rendered inoperable (even for AMD)." Is that true? Doesn't seem good or logical to me.

Can somebody explain this?

Z

Audioaficionado
09-27-04, 04:51 PM
First I've heard about it. If it's true then it's a deal breaker for me. We can still have NUMA in 32bit :D

zachj
09-27-04, 09:52 PM
Why on earth would that happen? Is it necessary? Is there a "conflict"? It makes no sense to me at all. That's a KEY part of the processor they're disabling. Sorta' makes it hard to run 32-bit and 64-bit apps at the same time, doesn't it?

Z

emboss
09-27-04, 11:53 PM
Completely wrong. x87 instructions are still available, and the FPU is not handicapped in any way. I think it's probably a language problem, as shown by sentences like: "PCI Express does not recognize this chipset" when referring to the AMD 81xx chipset.

Quote is from here btw:
http://www.tomshardware.com/cpu/20040927/opteron_vs_xeon-12.html

XWRed1
09-28-04, 12:54 AM
Umm what? The hardware fpu certainly doesn't turn off. That would be insanity, AMD would be royally screwed (in terms of competing against other processors) if that were the case.

emboss
09-28-04, 01:42 AM
On a bit more research, it appears that currently Windows XP 64-bit edition does not save the FPU state when task switching away from 64-bit mode applications. In other words, due to a braindead decision by Microsoft, 64-bit apps for Windows XP cannot use x87 or MMX instructions (must use SSE/SSE2 instead). I'll see if I can find a good page explaining this.

edit: Found a linky ...
http://blogs.msdn.com/ericflee/archive/2004/07/02/172206.aspx

zachj
09-28-04, 12:09 PM
So I guess we'll see how good AMD's SSE/SSE2 implementation really is . . . This is crazy. I gather MS will remedy the situation at some point . . . This will even hurt Intel on occaision, I imagine . . . But let me read the link. Maybe it's said something that makes what I'm saying stupid.

Z

EDIT** That's crazy to think that MS is getting rid of legacy code in such a manner. Fine, if they want to get rid of it, I don't care less. But disabling a key performance factor of AMD's processors seems just a little messed up. I guess that's why AMD threw SSE3 into the next revision of Athlon64 so quickly . . . What are the performance ramifications of having done this? Will AMD's dominance in science-type applications change? I'm really not up on what some of the more complex CPU structures "do."

Dylruss
09-28-04, 01:21 PM
Sounds like lazy programming on the part of MS.

rezon8
09-28-04, 01:47 PM
Sounds like lazy programming on the part of MS.

Exactly. Why would x87 in 64-bit mode hurt? I doesn't matter that sse is a faster and newer FP instruction set, some clients will have apps that run x87 code, and the whole premise of x86-64 is the backwards compatability with the previous technologies. I mean, if you can run 32bit and 64bit apps at the same time in WinXP64, why not x87 and SSE?

Also i am not a coder by anymeans, so i might be completly missing something as well. Is SSE really so much better that mmx/3dnow/x87?

XWRed1
09-28-04, 02:08 PM
Technically it shouldn't matter if they can't run x87, the compiler should take care of it - unless they are using hand-written asm.

Since they are going to be recompiling for amd64 already, might as well use the opportunity to get them off x87.

zachj
09-28-04, 06:27 PM
Well, again, I must say that it hurts an area of performance where AMD has excelled for a long time. Assuming their SSE/SSE2/soon-to-be SSE3 implementation is very good, I guess I don't care, but if it offers a way to squeeze out performance in applications being optimized for AMD processors, it's silly to cut it out. People optimize for P4s. Now it makes it more difficult to "optimize" for AMDs.

Z

XWRed1
09-28-04, 09:56 PM
SSE is faster than x87 anyways, and apparently much more sane to code for. Anyone with custom asm is going to have to change their code anyways.

emboss
09-29-04, 12:18 AM
Actually, that only applies to the P4. For single-precision where strict IEEE compliance is not required, you can do many things faster in 3DNow! (which is also lost when the FPU registers are dumped) compared to SSE.

The bigger loss is in the loss of precision in intermediate calculations. When you use the FPU to calculate a long equation, a lot of it is done on the FPU stack with 64 bits of precision, with the answer rounded to 53 bits (double) or 23 bits (single) when its stored back into memory. However, with SSE/SSE2 you only get answers at any step to single or double precision respectively. So, depending on the function being evaluated, you can easily lose a few bits of precision along the way. Some things are not even somewhat accurately calculatable using double precision. For example, if you want to calculate
1/(1-x1) - 1/(1-x2)
where both x1 and x2 are in the order of 10^-12 or so. For example, doing the above with x1 = 1.234*10^-12 and x2 = 1.432*10^-12 gives
-1.97841742988203E-13 (using doubles)
-1.98000036505386E-13 (using 80-bit internals for intermediate values)
The exact answer being
-1375000000000000/6944444444425930555555567827 or
-1.980000000005278680000010574E-13
(with a few zillion digits to the right). So the double-precision answer has an error of 0.08%, and the extended precision one an error of 0.000018%. Sure, in this particular case there's ways to improve the accuracy or use a recursive method, but it's still a serious problem that's bitten me in the ass more times than I'd like to remember (and made me give up on using MSVC due to it's inability to handle anything bigger than double precision). This problem (difference of 1/(1-x) values) is not particularily unusual, and is used particularily in hyperbolic geometry, and to a lesser extent in physics.

A final thing is the loss of the trancendental function sin, cos, etc. There are native x87 instructions for these that provide 64 bits of precision. You can get about 62-63 bits of precision (depending on the input and algorithm used) by doing a Taylor expansion and using extended's, but you're limited to 51-52 bits of precision should you try to do the same thing using doubles. Sure, it's not a big problem for games and the like (who typically only deal with single-precision numbers anyhow) but the whole issue of intermediate precision crops up again.

Moving on to the question of why MS did this ... well, the only reason I can think of is that they ran out of space in the context structures for threads (due to the bigger and more-of-em general purpose and XMM registers) and had to ditch something. The FPU registers were the ones that took the fall. Why do i think that? Well, unless MS is doing a somewhat funky task switch, the instruction that stores (and the corresponding one that loads) the XMM registers and state also stores the FPU state. You have to make a concious effort to NOT store the FPU state. Given that they do store the state for 32-bit mode apps, the code is obviously in there and working, so IMO that leaves two options:
1) MS wanted to discourage people from using x87/MMX/3DNow
2) MS for some reason could not fit the FPU registers into the thread context
Option 1) seems stupid. Given they have the code to do it, and have nothing to gain by forcing people not to use x87 {*}, I don't see why they would go through the effort to stop the state being stored. I can see why they dropped support for inline assembly from MSVC (simplifies the optimiser significantly), but we're talking about code that's already there and being used, just not being used in some circumstances. There's nothing hardware-wise stopping them (Linux saves the FPU state for 64-bit mode apps), so it's definately a "problem" at the MS end. In my mind, that means that for whatever reason, they just don't have the space to store it.

{*} AMD fanboys: insert conspiricy theory about the P4's weak FPU and Intel paying MS $$$ here :)

XWRed1
09-29-04, 04:33 AM
I thought sse or sse2 took 128-bit words. Do either of those lack the trig functions? I think I read they added a cross product in sse2.

If sse2 does 128-bit words, then surely even a taylor series for a trig function should turn out ok, if a little messy; but then the compiler should take care of that behind the scenes.

emboss
09-29-04, 06:52 AM
Both SSE and SSE2 operate on (the same) 128-bit *registers*. SSE does single-precision numbers only (and some limited logical and shuffling operations), and SSE2 does double-precision and integer numbers (though brings nothing new to the party in terms of mathematical functionality). Neither of these (nor SSE3) have trancendental support, and they definately didn't add a cross product to SSE2. What they did add was support for doing faster dot-products (strictly, they added horizontal sums), and to a limited extent faster cross products, in SSE3 (about time too, since 3DNow! has had it since the K6 ...).

Although SSE(2) operate on 128-bit registers, they can still only deal with single and double precision floating point numbers, and up to 64-bit integers (though either 1 or several of them at a time). So each term in the taylor series is limited in accuracy, and hence the result becomes inaccurate. Sure, until you get close to pi/2 the error is only in the LSB of the result (though once you get above about 1.35 you start having errors greater than just the LSB, but this can be fixed by taking a different taylor series centre) but IMO there should be NO variation in values of cos between platforms, given the same precision and same input number and assuming they all comply with the IEEE standard. There's undoubtably a way to reduce the error (especially by tweaking the precomputed factorial-reciprocals), but I would be very surprised (and impressed) if it was possible to calculate cos(x) exactly for all 0 <= x <= pi/2 to 53 bits of precision using 53 bits of precision in the intermediate operations. The older x87 coprocessors (IIRC) actually had 82 bits of internal precision for exactly this reason. I'm not sure of how they do things nowadays though.

Obviously single-precision trancendentals are no problem, as they can be evaluated at double precision then rounded.

And yes, the idea is that the compiler all magically hides all this from view (and the trancendental functions are done through RTL functions). This is all fine and dandy, but it requires formally-verified IEEE-compliant emulation routines. Being someone who has tried to write only an IEEE library (not worrying about formal verification), I can tell you that even writing a IEEE-compliant emulation routine is nasty. Having seen some of the formal verification proofs, there's no way I'd want to start such a thing :)

zachj
09-29-04, 09:16 PM
Jebus . . . I don't understand a THING you two just said :-( I took a VB class and a C++ class in high school, but it was basically a "show up and shut up" kinda' class, where as long as you didn't look like you were doing nothing, the teacher didn't teach and we got an A . . . But I can't program. I wrote an AppleScript the other week. I was so proud. Now I'm embarassed.

Z

dark_15
10-03-04, 11:01 PM
ZachJ - I know how you feel... I was like wow after some of those posts.

Where did you all learn that stuff? I'd like to know too...

ookabooka
10-04-04, 08:38 AM
ive done some asm programming, but only for microcontrollers. I have a vague comprehension of what he is saying. Largest registers ive delt with were 8 bit. . . kinda wimpy compared to the 128bit for SSE. The trig stuff makes sense to me as ive done a lot of work in physics engine design for computer games. Very important to have fast trig functions that are uniformly accurate. Anything heavy calculation intensive and i use linux, and yes, i have taken advantage of my lil amd64's 64-bit math. Increases the speed of a lot of my calculations by a crapload. . .so as long as it works in windows good enough for computer games i dont really mind. Could you explain what is going on on a slightly lower scale? Like, why 3dnow would get dumped. I dont know about how often windows uses it, but i use it where ever possible in linux (Gentoo) and find it nice to know that im using every single one of my processor's abilities.
So what happens to the processor and code in windows? 3dnow commands will cause a segfault? and non-sse math commands will do the same? How does mmx fit into all this?

emboss
10-04-04, 07:26 PM
OK, here goes :)

(note to purists: ya, some of the stuff below is not entirely correct. If you want to try to explain IEEE denorms, limitations of EA fields in addressing memory and the like then feel free :) ).

Back in the good ol' days of the non-MMX Pentiums, there were effectively two parts to the CPU: the integer part and the floating-point part. The "state" of the integer part consisted of the values stored in each of the eight 32-bit registers, and some other things like control flags and the like. The state of the floating poin part consisted of eight 80-bit registers and associated control flags. So to switch between tasks, the OS would stop executing the current task, save the integer and floating point states to somewhere in memory, load up the integer and floating point states for the next task to execute, then let that task run for a while. Of course, it's a bit more complex than this, but that's the basic idea anyhow.

At this point in time, there were 6 main data types that the CPU could deal with:
[] 8, 16, and 32-bit integers
[] single (32-bit with 23 bits of precision), double (64-bit with 53 bits of precision), and extended (80-bit with 64 bits of precision) floating-point numbers.

Floating-point numbers are stored in the binary equavalent of "engineering" notation, like 1.23*10^2 for 123. Or in binary, 1.01011*(10)^100 for 10101.1 (21.5 for you decimal folk :) ). The 1 at the front is not stored, as it always must be there (otherwise you have 0.1yz which can be stored as 1.yz and increase the exponent by one).

So, for a single precision number, there is one bit for the sign, 8 bits for the exponent, which leaves 23 bits for the mantissa (the bit following the decimal point). So, such numbers have an accuracy of about 6 decimal places if I've done the maths right. A double precision number, with its 53 bits of mantissa, is accurate to about 15 decimal places. But it's far easier and more accurate to talk in binary, as there's all sorts of problems like being unable to specify certain numbers like 0.3 exactly in a floating point binary number.

To actually do a floating-point operation, the data would be loaded out of memory and into the FPU registers. The FPU could only only deal with 80-bit numbers, so there was an implicit conversion from single/double to extended when the data was loaded into the FPU. Then, calculations would be done (using 80-bit floats), and when the calculation was done it would be saved from the FPU registers back into memory (with a conversion to single/double if so requested).

To do an integer operation, things are a little more confusing. Although the x86 architecture only has 8 registers, it has eight 32-bit register, eight 16-bit registers, and eight 8-bit registers. This is done by reusing parts of the 32-bit registers to hold the 8-bit registers (or alternatively, only operating on part of the 32-bit register).

The x86 registers can be divided into two groups:
[] 4 general-purpose registers: eax, ebx, ecx, edx are the 32-bit general-purpose registers. Take eax for example. There is a 16-bit register called ax which is mapped to the bottom 16 bits of eax. So if eax contains the hexidecimal value 0x12345678 then ax containes the value 0x5678. Also, there's two 8-bit registers, one mapped to the high byte of ax (ah) and one to the low byte (al). So with the value of eax above, ah would contain 0x56 and al would contain 0x78. The other three registers ebx, ecx, edx have the same structure.
[] A stack pointer (esp), a "base" pointer (ebp), and two "index" registers (esi and edi): these are 32-bit registers, and are almost the same as the above except they don't have any 8-bit parts mapped in. So there is a si register but no sl register.

To do an integer operation of a particular size (8, 16, 32 bits) you load up the (8, 16, 32 bit) registers with the value and do the operation. You can only operate on one pair of registers at once, and you cannot operate on part of a register (eg: you cannot do an 8-bit add of the lower bytes of si and di).

The introduction of MMX changed this a bit. To maintain compatibility with existing task switchers, the eight 64-bit MMX registers were mapped over top of the FPU registers. This is so that if you had two mmx-using programs running at the same time, they would not interfere with each other even on OS's that didin't know about MMX.

The actual operations that MMX introduced were somewhat different that the existing model. The 64-bit registers could be operated on as either eight 8-bit registers, 4 16-bit registers, or two 32-bit registers. In the esisting model, adding 0x80808080 to 0xC0C0C0C0 (in 32-bit registers) would result in the value 0x41414140. If the register was treated as a collection of four 8-bit registers, then the result would be 0x40404040. This mode of operation (splitting the register up) is called packed-integer maths, as you're packing multiple smaller values into one bigger register.

Shortly after, AMD intorduced 3DNow!. This used the MMX registers to do maths on two 32-bit floating-point numbers. In terms of moderately-accurately performance (eg: 3D games) it gave a significant performance boost, especially with the K6/2 and later. Alas, this was back when noone was optimising for AMD CPUs, so the programs that used it were few and far between. It's lack of IEEE compatibility (it sacrificed things like correct handling of infinities for speed) didn't help things greatly either.

In reply, Intel brought out SSE with the P3. This was not backwards compatible with earlier OS's, as it introduced yet another set of registers (eight 128-bit registers) that needed to be stored on a task switch. If the registers were not stored, then two programs that used SSE would have their SSE registers overwritten by the other application after a task switch.

These registers were 128-bits wide, and supported both operation modes: they could do a packed operation on all four numbers, or a "normal" operation on the lowest (32-bit) number only. The latter mode was slightly faster to execute, but of course you only got one operation done instead of four. However, you could not operate on the whole register as a single entity (except for loading data into the register), so you couldn't treat it as a single hugely-accurate floating-point number. Also, double-precision operations could not be done.

SSE was added to the AthlonXP, but in many cases 3DNow performance was better than SSE, especially in matrix operations. This is primarily due to decoder issues, but also the flops per cycle of SSE to 3DNow is not that much different. I can't remember the exact timings, but a 3DNow dot product was in the order of 30% to 40% faster than the SSE equivalent. OS's soon gained the ability to save the SSE registers during a task switch.

The other thing of note is SSE2, which introduced double-precision and integer operations to SSE registers. You still couldn't treat a SSE register as anything more than a optionally packed collection of 32-bit integers or double-precision floats. So you couldn't use an SSE register to do 64-bit integer adds, or add 80-bit floats.

With the Athlon64, things have changed yet again. In 64-bit mode, there's now 16 SSE registers, and sixteen (now 64-bit) general-purpose registers. Obviously the task switcher has to be modified to store all these correctly. What MS has done is say that they're not going to be storing the FPU registers any more. This means that instructions that use the FPU registers (pre-SSE floating point operations, MMX, 3DNow) will not have their state saved over a task switch, so if multiplie applications use such instructions they will interfere with each other. However, it's possible for the OS to detect when the application uses the FPU, so XP-64 will probably nuke the program when this happens.

Finally, moving on to the whole precision issue. When doing an addition, the smaller number is adjusted so that the exponent is the same as the big number, and any trailing bits of the number are lost. usually one bit more than the length of the number is kept. For example, adding (in binary) 1.001E00 + 1.111E-11 first changes it to 1.0010E00 + 0.0011E00 then does the addition, getting 1.0101E00 which then gets rounded to either 1.011E00 or 1.010E00 depending on how the rounding flags are set.

A small example in binary might make the issue easier to see:

Add up the numbers 1.1, 10.1, 11.1, 100.1, ..., 111.1 keeping only three bits of precision (after the decimal place).
1.1 + 10.1 = 0.1100E01 + 1.0100E01 = 10.0000E01 = 1.000E10
This is correct.
1.000E10 + 11.1 = 1.000E10 + 0.1110E10 = 1.111E10
This is correct.
1.111E10 + 100.1 = 1.1110E10 + 1.0010E10 = 11.000E10 = 1.100E11
This is correct.
1.100E11 + 101.1 = 1.1000E11 + 0.1011E11 = 10.0011E11 = 1.010E100
This is correct (rounding up, as is common).
Continue it yourself, and you should see that soon you "lose" the least significant bit of the input from the second oprand. I'll complete it later when I have the time :) This is the danger of only storing intermediate results at the precision at which your inputs numbers are. Once your intermediate value gets to be 4 times bigger than the input values, you'll start losing precision. With the "traditional" approach, with the 64 bits of precision in the intermediate results, you don't have to worry so much about losing bits off the bottom.

How does this apply to trig functions? Well, currently they're calculated to 80 bits of precision, then rounded to the desired size. The way to calculate it without a hardware trig function (is: using SSE/SSE2) is to do a repeated add/multiply thing, which, like above, loses precision (though only in the least significant bit or two). However, the value obtained using the SSE method with not, in general, aggree with the value obtained using the native tan function. So you'll get different results for function between 32- and 64-bit mode code (or even between running the identical code on Linux and XP-64), even when it's compiled from identical source code. This is not a Good Thing.

OK, that's probably enough for now. I'll post a bit more later if anyone's still interested :)

zachj
10-04-04, 08:42 PM
Does one need to understand this to program? Only in assembly? If one does, that would certainly explain why I can't program. I guess I "understand" what's going on, but I can't talk binary, and I just stop paying attention when I read it :) It's a good explanation, though.

And I thought I knew a lot about processors. BAH! I don't understand the actual computations being done in hardware, and I don't think I ever will.

Z

ookabooka
10-05-04, 02:50 AM
awesome, u prolly just shaved off 10 credit hours off my computer engineering degree :-p I was quite upset not to see 3dnow used everywhere, part of the reason why i like gentoo, because it can try to use it whenever it can. Should mmx and 3dnow be deprecated? Wasnt there a 3dnow2 or 3dnow_ext?

man_utd
10-05-04, 08:09 PM
3dnow+ anyway

aka1nas
10-07-04, 06:42 PM
I thought AMD's implementation of SSE decoded it into FPU instructions on chip to leverage the Athlons comparatively strong FPU? I thought this was why the athlonXP has relatively weak performance gains with SSE1 optimized apps relative to the P4? Is this true and is it still the case on the K8? What does this mean with regard to running 32-bit programs on XP for x86-64 when it comes out?

Polariz^
10-11-04, 11:29 AM
My head hurts...

Big_Baller
10-14-04, 02:21 PM
Emboss is nerd jesus. (bows) That was awsome.