Multi Processing For All

Add Your Comments

As dual core processors become affordable, many will be tempted to enter this brave new
world. I’ll spare you the typical marketing hype and hopefully shed some light on some of
the technical details that will be ultimately more useful in your purchasing decisions.

The solutions from Intel and AMD differ greatly in their design.

Anyone who reads here
knows that the AMD X2 dual-core design is superior to the Intel Pentium D dual-core
design, and the new Intel Core Architecture is superior to both. But before we just believe
all the marketing, let us try to understand why this is so.

A Quick Review

To know where you are sometimes it is helpful to look at where you’ve been. For many of
us, the first, and up to now only, dual CPU experience has been the good old dual
Celeron systems of days gone by. Whether it was a slot 1 system with a slocket or
an Abit BP6, you could be left with mixed impressions. Especially at stock speeds,
a dual Celeron system seemed even more sluggish under Windows NT 4.0 than a single
processor system. Turning up the front side bus seemed to help a lot.

Along the way some of us have had the virtual dual-core experience of a Pentium 4 with
hyperthreading. This was either the next logical step in processor design, or perhaps an
experiment on the road to true dual-core. This design didn’t have the same problems
as the dual Celeron system. Obviously, this is a much superior processor compared
to the old Mendocino Celeron. But there is also a major design difference, which,
as we shall see, makes all the difference in the world.

Now we have dual-cores. On the Intel side of things, their early dual-core chips
are just basically two regular cores glued together through the usual SMP front
side bus arrangement. With AMD, a more sophisticated design provides a bus
that allows the cores to talk to each other. If you’ve done the smart thing and waited,
then your reward will be a much improved chip from Intel, the Core 2 Duo.

It’s All About the Cache

On-die full speed L2 cache is probably the greatest innovation in CPU efficiency
and performance, clock speed notwithstanding. If you don’t believe me, go into your
BIOS and disable your L2 cache and then try to boot Windows XP.

(WARNING! DO NOT ACTUALLY DO THIS! Your system will take forever to boot
and even longer to shutdown, if it even makes it through the booting process at all.)

When the dual Celeron age first began, many said that the tiny L2 cache of the Celeron
and its slow front side bus would make these systems tend to perform poorly, and in
general, they were correct. Before you pass 100% blame onto the hardware, be aware
that the operating systems of the time weren’t optimized for this hardware configuration.

The problem lay in that each CPU had its *own* L2 cache, and the protocol to maintain
cache coherency is not particularly efficient. When processor #1 in an SMP system
writes to memory, the processor #2 must look to see if it has that data in its
cache and if it does, it must mark it as invalid. Next time then that processor #2
needs to access that data, it must reload it from main memory. Can we say overhead?

Things are further complicated in the way L2 caches are organized and aligned. Memory
reads and writes happen 32 bytes (one half cache line) or 64 bytes (one full cache line) at
a time, with cache lines typically aligned to some even-byte boundary. Even if two processors
are not accessing the same data address, they may be accessing data which is close
enough together so that it falls on the same cache line.

As you can imagine, two processors
reading/writing to the same cache line over and over again would cause system memory
utilization to skyrocket, and the system slows to a crawl. I refer to this condition as
cache-thrashing. A negative performance effect is seen because it is the equivalent of
disabling the cache, flooding memory with both read and write requests and potentially
saturating the front side bus.

Pentium 4 with hyperthreading didn’t have this issue, as it only had a single shared
cache. However, with present dual cores, each core has its own separate L2 cache. This
makes the current Intel designs which rely on the old interface prone to having L2 cache
performance issues, while the current AMD chips have an edge in this regard and it
probably contributes to their superior performance.

What’s the Schedule?

In order to understand what’s going on, you have to look at what’s running on your system.
You have an OS with many threads, most in a sleep or waiting state, and one or more
applications single-threaded or otherwise. Calls to the operating system tend to access
operating system data and tables, and both processors are making these calls, so there
isn’t too much that can be done about the shared data in that instance.

But data that is local or belongs to an application is a different matter. If that application
always runs on the same processor, then there is a better chance that the data needed
by that application may still reside at least somewhat in that processor’s cache the next
time it gets a slice. L2 cache is most important, but let’s not forget some of the other
tables that the processor maintains that we don’t normally think about, such as the
page table cache, the branch prediction table, and so on.

If, on the other hand, the application is allowed to bounce back and forth between processors,
the results will be disappointing. Not only is the L2 cache being thrashed, potentially the entire task
context is thrashing as well. The solution to the problem is to have an operating system with
a smarter scheduler, and applications themselves that can request a specific processor, on a
thread by thread basis.

Which brings up one last issue to tackle:

If multithreading applications are required to
truly maximize the use of two cores, then that would imply that the threads of a particular
application are distributed amongst processors. If this is the case, and both threads are
actively reading and writing to the same area of application owned memory, then
the potential for L2 thrashing once again appears.

The point here is that multithreading
isn’t something you can just slap onto a program for the purposes of marketing hype,
though many will. It is very important that it be well thought out in advance, and that
the extra threads be very carefully designed to avoid behaviour that would actually
degrade, and not improve, system performance.

Thoughts for Early Adopters

In the meantime, there may be some things those of you with early dual core systems
can do to experiment with different things, if performance from a particular application
isn’t quite what was expected.

Using the Windows XP Task Manager, go to the
Process Tab, right-click on a process in the list and a pop-up menu will allow you to set
the processor affinity for a particular process. (Note that the option to change affinity will
not show on a single processor system.) This let’s you force a particular program
or thread to run on a specific processor or core.

Core 2 Duo

As of this writing, Core 2 Duo is obviously the best dual core solution offered to date.

The L2 cache is shared between cores – say goodbye to cache thrashing and
don’t worry about processor affinity. The bottlenecks that crippled earlier SMP
or early dual core systems should be a thing of the past. Take it from someone
who has written and optimized computer software for over 20 years, a shared
L2 is the right way to go, and I’m glad someone finally did it.

This should make
the smooth performance of a dual-core system even smoother, and should also
allow for more efficient multithreaded number crunching, since now you can have
two cores essentially working on one set of the data, instead of managing two
separate sets of the working data.


Leave a Reply

Your email address will not be published. Required fields are marked *