(Imported from MoinMoin)
m (1 revision(s))
Revision as of 16:29, 24 May 2008
- 1 Hardware Problems
- 1.1 Use your Hardware within spec as defined by Vendors
- 1.2 Memory
- 1.3 BIOS
- 1.4 Don't fool yourself with faulty or wrong used hardware
- 1.4.1 Check cabling is secure.
- 1.4.2 Check your power.
- 1.4.3 Balance the power cables out.
- 1.4.4 Check cabling isn't obscuring airflow.
- 1.4.5 USB devices plugged into hubs.
- 1.4.6 Cable convertors.
- 1.4.7 Make sure you have the right cables.
- 1.4.8 Make sure your cables have no loose contact.
- 1.4.9 Put slow IDE devices on their own bus.
- 1.4.10 Check your harddisk for errors.
A number of bugs get reported that really don't make a lot of sense. The cause all sorts of head-scratching among kernel developers. Whilst most bug-reporters don't like to hear that their shiny new hardware may be broken/crap, sadly this is the case sometimes. Here's a few tips that may help root-cause hardware problems.
Note: Screensavers especially the 3D OpenGL ones might cause strange behavior. Try disabling or removing them to verify this
Use your Hardware within spec as defined by Vendors
Use the CPU clock speed that the CPU was rated for.
Don't overclock. Don't overclock. Don't overclock. Even if alternative operating systems like Windows works fine on your 6GHz water-cooled Pentium 4, this itself is not a guarantee of stability of your hardware. In some cases Linux can push the theoretical limits of the machine (by utilising all available memory bandwidth for sustained periods of time for example). Under such extreme load, the CPU will be generating a lot more heat than it will sitting idle on other operating systems thus exposing your hardware problem under Linux is a more evident fashion.
Be sure to use adequate cooling.
In some cases, it has been discovered that even the CPU coolers that come with some OEM systems aren't quite as good as some 3rd party coolers. It's worth spending a little extra to ensure your CPU stays within its temperature limits whilst under sustained heavy load. Also, not all thermal grease is equivalent -- some of the more expensive options really do get things running cooler. Be sparing when applying this stuff btw, too much is a bad thing, and adversely affects its ability to dissipate heat.
Spring clean your cooler(s).
Periodically take your system apart and clean out the dust and buildup from any parts with fans. Including graphics boards.
Make sure your power supply is adequate to power all the peripherals you have attached.
The gotcha here is that whilst it may be adequate to get the OS booted, when it's actually doing some work (like a big compile, or running doom), it's going to use up more power than it would whilst idling. All of this power has to come from somewhere. If the PSU can't supply enough, something is going to be underpowered, which can result in very strange kernel panics.
Test your RAM
Use Memtest86. Yes it takes ages to run. Sometimes it takes at least a day before it shows up that there's a bit error in some DIMM. (The worst I've seen was an error that only showed up after a week long run). It's really worth the time testing though. If you don't do this test, and the problem really is flaky RAM, then the 'bug' will never be fixed, and just cause extensive head-scratching.
The easiest way to start this test is to boot a Fedora installation CD. At the prompt, type "memtest86".
Use Memory that is tested for your Board/System.
Most Mainboard-Vendors have list with tested modules. Buy those.
Don't run the 2 year old modules from your old board
Sometimes older modules don't work correctly in newer systems
Where possible avoid double-sided RAM.
Some boards don't seem as well tested with double-sided sticks as they do with single sticks. In one reported case, double-sided RAM only worked if one out of the two slots. Or it needs to run slower than specified
Avoid mixing modules with different size/organisation
Some boards don't like it.
Check RAM timings in BIOS.
Mixing and matching DIMMs of different speeds is a sure-fire way to introduce really bizarre bugs and stability problems. Also make sure that the speed settings in the BIOS are either set to AUTO (sometimes labelled 'SPD' [serial presence detect] ), or has the correct speed for the DIMMs. As noted above, double-sided DIMMs can be problematic. Don't mix and match RAM where possible.
Reset BIOS to defaults.
Don't try to be smarter than your Hardware-Vendor. In most cases the default options are sufficient. If not:
Reset BIOS to safe defaults.
A number of times, users have reported issues that manifest themselves as really obscure oopses that don't really make a lot of sense. They turned out to be things like 'CAS timing' set too aggressive on systems with cheap RAM. (A number of times, these settings worked fine until the user added an extra DIMM). Interestingly, this problem didn't show up under memtest86 [although maybe it would have if left to run long enough]
Be sure to be running the latest BIOS from your motherboard manufacturer.
Modern Linux kernels make heavy usage of ACPI during bootup for things ranging from hardware discovery, to interrupt routing, and power management. From time to time, broken tables in the BIOS are discovered, which can cause crashes, or lockups before the system even boots. Even non-ACPI related tables are wrong from time to time, so this is one to check, even if you don't use ACPI.
Just say no.
Don't fool yourself with faulty or wrong used hardware
Check cabling is secure.
We've seen various strange reports especially ranging from "floppy occasionally writes/reads bad sectors", "hard disk spews random error messages", "graphics get corrupted" which turned out to be power cables that haven't been fully pushed home. If you use Y shaped 'splitter' cables for these cables, and you see problems on one specific device, try a different cable, sometimes these things aren't built quite as well as the connectors coming from the PSU.
Check your power.
There have been a number of bugs reported that 'disappeared' when the reporter moved the affected machine onto a UPS or an alternative power circuit. There are a number of surge protector type devices that 'clean' the incoming power before it reaches the computer. They make for a valuable investment.
Balance the power cables out.
Don't have multiple Y cables with a dozen devices coming off a single PSU spur. Use 1 Y cable per spur where possible. (If you need more, it may be a sign that you need a larger PSU).
Check cabling isn't obscuring airflow.
Fans should be completely unobstructed, ensuring that air can inrush in the lower front of the case and be blown out by fans and power supply in the top of the back.
USB devices plugged into hubs.
Not all hubs are powered, and instead, draw their power from the computer instead of having its own power supply. If you plug too many devices into one of these hubs which draw more power than the hub can draw from the computer, you may observe strange USB failures. Choose a powered hub if you think you will be using devices which may draw lots of power (printers/scanners etc)
Attaching a device to an interface through a dozen cable convertors and gender changing devices is going to weaken the signal strength which may result in errors. Make sure you buy the right cables rather than hacking together a frankencable.
Make sure you have the right cables.
An 80 pin IDE cable is essential for anything above ATA33 for example. Using a 40 pin cable might work for low speeds, but you'll get DMA errors later when you try to do faster transfers.
Make sure your cables have no loose contact.
Especially IDE cables weaken after several Pluggings. Try other cables if you see DMA problems.
Put slow IDE devices on their own bus.
Typically CD drives/ata floppy drives, these things aren't particularly fast, and when used as slaves on the same bus as a hard disk, it can impact hard drive performance, and in some cases, cause corruption. Additionally, if you have both ATA33 and a higher speed ATA bus, putting the CD drives etc on the slower bus is advised.
Check your harddisk for errors.
Your harddisk can have problems you don't notice in normal use. Modern disks try to repair errors as they happen, use a SMART tool to monitor and diagnose disk problems. The "smartctl" tool can be used as described here: http://fedoraproject.org/wiki/smartctl