An extended tale on debugging a PC problem. – Dale
Well, I wasn’t sure how to title this story/report/’fable’! I know, they don’t get sad or happy, they just run programs! But sometimes, it makes you wonder!
I’ve had my share of problems with the half-dozen systems I built and maintained over the years, but this one beat them all. Also, I’ll be the first to admit that overclocking a system isn’t the nicest thing to do to a fine working computer setup. But what fun is there in mediocrity!
And beside, I had all these notes scattered around that I wanted to save. So, to save them for myself and to share the frustration and end result with all of you, here is the dreaded tale.
For months, I had been having some difficulties with random lockups and other issues. Not in any particular program or operation, so I couldn’t pinpoint the source of the problem. These random problems included complete restarts and complete lockups that required the use of the reset button. Also, what I call a “Stall”, in which the entire system refuses to respond but all of the systems seem to be working. And there were internet browser lockups and/or protection errors that resulted in the program shutting down and modem failures that could be corrected only by restarting the system.
Since I hadn’t done a clean install of the OS since changing mainboards, I suspected this was at least part of the cause. But, since doing a ground-up install is a time consuming process, I put it off till later. Bad idea!! Again, I could not point to any particular program or piece of hardware that caused the problems since it wasn’t evident with a complete failure or a consistent lockup.
There of course were some hardware changes, such as a new CD-ROM drive and an external modem. And lots of software that gets loaded and used or, more often, loaded and discarded. There was also a BIOS update and a beta BIOS. While these I am sure did a lot of changing, none seemed to cause any additional problems.
I installed, and still use, Zonealarm, IE6, and a new antivirus software program. The only single piece of hardware and software I could suspect was the addition of a QuickVideo Weecam and associated software. It got power from the keyboard connection and used the serial port for data. It loaded and worked fine for many days before the more serious problems began.
One morning it wouldn’t start!
It worked the night before and it was shut down properly! No storms or power problems! What I got was a blank screen. Fans running but no POST. First thing I did was begin pulling components. That’s what you are supposed to do first, right? I had it all torn apart in about 30 minutes time with no solution. Also, during this time I set the thing to stock speed.
- Enlight case
- Enermax 431 watt PSU
- Asus A7V133 ver 1.5
- Athlon 1 gig-200 FSB
- Taisol copper based HSF
- WD hard drive
- Plextor CD-RW
- CD ROM
- ATI 64 meg AGP card
- external Diamond modem
- Soundblaster card
- Panasonic E15 monitor
After spending several hours checking the hardware, I eventually had nearly every component removed and reinstalled several times from the system. I tried a different CPU and tried several BIOS changes with no luck. Went back to jumper settings with both CPUs – no change. I switched the cables for the HDD and CD-rom drives to about every combination possible with no resolution to the problem.
So, I then cleared the CMOS and loaded the default, no good. Flashed back to the original BIOS, nope. Flashed to the newest BIOS from Asus, no joy. There were other things that I tried but I can’t remember them all.
Eventually, as the system was sitting there with no video output but seemed to be running, I pressed the power button by mistake. I intended to press the reset button. Suddenly the video came to life and it began to POST. What the heck???
After that, it would boot sometimes, but would lock up or shut down on its own. A lot of weird things happened during this time. I finally decided it was a bad video card. I had a Visiontek AGP card with some problems that I use for a backup, so I put that one in and it booted. I also pulled a PCI video card from another system and it booted into the system with that one as well. Well, it had to be the video card so, I ordered a new video one, a Diamond Stealth III.
Got it a week later, installed it and it worked fine. So, off to ATI goes the other card. About 2 weeks later I checked on the RMA and was told a replacement was sent. Fine, I’ll put it in and get back to where I was. Not so fast! A note from ATI said the card I sent checked out fine but they sent a “service unit” in its place anyway. Now what?
It was several days before it finally got to me. I just had to see if the ‘service unit’ card would work. Well, it didn’t! I tried everything. Uninstalling the old video software, setting the video to Standard PCI VGA, restarting, reinstalling, nothing worked. What was happening now was there was a message about a ‘windows protection error’ and the system was halted.
I could then restart into “safe mode” and restart with the video in standard PCI VGA configuration. But, I could not change the video settings and have them take effect without restarting the system. When I did restart the system, I would get the same protection error and have to go through the same process to get into the OS.
So now I start to suspect other components. For some reason I was sure it was hardware related. The next two suspects were the hard drive and the CD-ROM drives.
I begin thinking that the Win98 OS was somehow not installed correctly – made sense at the time! I did a reinstall, figuring there were files that were somehow missing or corrupted. Didn’t help. Then I started wondering about the Plextor CD-RW drive that I had been using for most of the software installs.
At this point I was trying things that didn’t necessarily make sense. How could the OS install go completely without incident but leave any of the copied files corrupted? I also had a refurbed Compaq pull CD-ROM drive installed. So I tried switching the cables so that the regular CD-ROM drive was the install drive. No change! But I did notice it was much faster than the Plextor! That’s odd!
Next up, the hard drive! For reference sake at this point, let me give names to the three hard drives involved. They are all Western Digital. The 10 gig that the system was running with prior to the problems I’ll call the “safe drive”, since I tried to keep it protected throughout this whole process. Then there is a 20 gig drive that was used as a periodic backup, called the “big guy”, and a third drive of 1.2 gig capacity called “Little Mo”. Little Mo is old and only used when there are problems since it is not large enough to hold all of the installed software.
First, I did a copy of the entire safe drive onto the big guy so that the two would be identical and I could keep the safe drive “safe”. This is done with Western Digital DataLifeGuard Tools and makes an exact image from one drive to another. Now I isolate the safe drive and reinstall the OS onto the big guy, as I suspected, nothing changed.
Next, I format the big guy and do a reinstall of the OS. I am still getting the protection error on restart. OK, so I run the diagnostic tools on the drive and get an error code of “0202”. I check WD website and there is no description of this code and associated problems.
Next, I email WD support and ask if it needs to be RMA’d. By the way, let me say here that I received excellent responses and help from Western Digital and at this point would not consider another brand.
The first response was from Theresa and is as follows:
“I recommend that you write zeroes to the drive to wipe it clean of all data so you may proceed to partition and format the drive with the drive back to a brand new state. “Writing Zeroes” to the drive will completely erase all information (both system and data files) from the drive”.
Of course Theresa went on to explain that the process would destroy all data on the drive and the procedures to follow.
During this time I set up Little Mo and installed the OS and a minimum of other software, you know, just to keep things going..hehe. To my surprise, it worked with the ATI video card where the big guy would not. That is really weird! I essentially went through the same process with Little Mo of formatting the drive prior to the install but NOT writing zeros.
So, before I ran the “writing zeros” to the big guy, I did an image from Little Mo to the big guy so they would be identical in content. Now, you would expect that if it was a hardware problem, there would be no change, but it worked fine. It worked just like Little Mo did! Back to the drawing board!
Then I wrote back to WD tech support, told them about the image copy working and got this message back from Jason:
“Thank you for contacting Western Digital.
A format or FDISK will not wipe the drive for you in case there is corrupt data. The Write Zeros to clean the drive for you. Please keep us updated on the status and we will be glad to provide you with additional support if needed.”
When the big guy worked after the image copy from Little Mo, I knew it wasn’t a hardware problem. You may not agree and you may be able to see something that I could not, but it sure did look like a hardware fault up till then. However, just to complete the processes and leave less to chance, I did the “Writing Zeros” to the big guy.
It took 1 hour and 20 minutes and another 20 minutes to run a scan. DataLifeGuard tools then reported a code of 0000, which means that all is well with the drive. The only other issue that persisted was that when initializing the WD DataLifeGuard Tools, it reported a warning message stating that the drive parameters were incorrect. But, the program ran fine and the diagnostics finished and reported. I also communicated this to WD and at this writing, I have not received a response.
After the zeros were written, I tried to load the OS from the CD-ROM and as expected, it would not load due to the fact that the hard drive had not been formatted. Did that, loaded the OS and got all the way to loading Windows for the first time. Now I loaded the ATI software and restarted. Guess what appeared again? The dreaded “windows protection error”. I was hoping it would work, but deep down I knew it wouldn’t!
At this point, I am still lost except that now I am sure it is a software problem. I then spent several evenings searching the internet for the causes of this error. Yea, I know what you are saying. “Why didn’t he do that first”? Stubbornness IS one of my strong merits.
Several places indicated that there are dozens of conditions that can cause this error message to appear. Something that came up a number of times was the .vxd drivers, of which there are many. The mainboard used is an Asus A7V133 and one of the things to install is the VIA 4-in-1 drivers. I knew there were such drivers within the VIA 4-in-1 software package, so that is where I focused my attention.
I took an educated guess and decided to work on the AGP VxD Drivers. I reinstalled only that part of the driver set and guessed again at using them in the Turbo Mode. Well, I got lucky. After this reinstall, I bypassed the restart and went to the display settings.
Here I set the screen resolution to 800×600 and let it restart. The error did not reappear. Now I scramble to document all of this as best I can remember. No, at this time I do NOT want to try to recreate the problem and do better documentation!
However, after getting most of the other drivers and software loaded and set up, I have not had the same error message return. But all is not completely well. As I was writing this article, I noticed the hard drive LED being lit, so I quickly saved my work and within seconds the system rebooted by itself. Temp was at 40C, no noises and no other programs running.
Great, more poltergeists! I did save my work here and am now going to shut down and remove one of the two DIMM modules. This reboot problem is going to take a long time to figure out and it is easy enough to run on 256 meg of memory and watch to see if that may be the culprit.
After realizing that the problem was in the VxD drivers, I wanted to try to restore the use of my original setup which was still on the 10 gig drive, the safe drive. Since this drive had the bulk of my software installed, I kept it protected during this process.
After isolating the other drive, I rebooted with the safe drive as the BOOT drive. As expected, I got the “windows protection error” and the system halted. I then pressed the reset button and booted into SAFE MODE. The ATI video card was listed in the driver section of the display settings so I re-installed the VIA 4-in-1 drivers and let it restart.
Got the same error and system halt, again as expected. Redid the safe mode and restart and this time I re-installed the ATI drivers and restarted, but with the same result. It’s getting interesting now!!! Next, I decided that either the ATI software or the VIA software is overwriting the other, but which one?
So, my guess was that I needed to install BOTH, but WITHOUT restarting.
So, I re-installed the VIA drivers and answered “NO” to the restart question. Then I decided that the software for the ATI was probably still there and intact so I went to the display settings and changed the resolution to 800×600. Note that the video driver still said the ATI card was active and the color settings were at minimum. After restarting, the error message was gone and the ATI “Getting Started..” screen came up and I was able to make all other adjustments without problem. The system is running fine and, except for that darn reset on its own, is going strong.
Still, I am unable to determine what caused this problem with the vxd drivers. The system was working at shutdown but then refused to work at the next bootup! Where did the problem lie, who knows? I may never figure it out. But, I’ll know what to do if there IS a next time!
To all of you who have suffered through this long and drawn-out story, I thank you! If you have ideas or answers to the WHYs and HOWs of this whole mess, please let me know. I’ll entertain about any suggestions and comments. I plan to contact VIA and ATI about this and try to resolve some of the questions it has formed.
I know there have been many issues with the VIA drivers, AMD based boards and video card compatibility, but I cannot remember such a situation being mentioned before. If you know of one, please let me know.
My hope in writing this is that it may serve useful for similar situations for others. It was a problem that was not immediately evident as to the cause. Going from ‘no bootup’ to what seemed like a video problem, then CD-ROM problem, then mainboard and/or BIOS problem made it very frustrating.
And ya know, as we put more and more ‘stuff’ in these things, it will likely get harder and harder to find such problems!