Screen-shot-2012-08-15-at-12.56.30-PMServers, like home furnaces, are critical pieces of infrastructure that should work without issue for many years. But eventually they will start struggling, maybe rattling a bit or not working as efficiently as they once did. They might limp along for a while, but one sad day, they go dark. Maybe it’s a quick component fix, or maybe you’ll have to replace the whole dang thing.

If you’re unable to get things back up and running quickly, you could be looking at expensive downtime while you try to find a replacement part or new system… all the while suffering the consequences. Don’t let this happen to you! Let’s take a look at signs a server is about to fail as well as some server life cycle basics.

So, how long can you reasonably expect a server to hold up? It varies, but there are several rules of thumb:

  • The average lifespan of a server is roughly five years, but the recent trend has been toward shorter life cycles. Aging servers, in addition to having increased risk of failure, are less powerful and less energy-efficient than newer models. Additionally, individual components within servers are likely to fail before the whole thing needs to be replaced.
  • Servers used in production on mission critical apps are as good as their warranties, which typically run three to five years. If they fail before then, you know the manufacturer will have your back in the form of service and replacement parts, but keeping servers online after that is a gamble you don’t want to take.
  • The cost of maintenance on aging machines can take its toll. If you don’t have the cash to replace hardware or if you’re forced to use older infrastructure, the costs can add up. And if you experience an extended failure, the downtime can have a serious effect on your company’s bottom line. Therefore, early problem diagnosis is important!

So what are the red flags that might pop up before a server crash wrecks your plans for the weekend and drains your company’s bank account? Here’s what to look out for:

1) Temperature troubles: CPU running hot

Like a human being, a server might be in trouble when it starts running a fever. According to one vendor, every increase of 18º F above 68º F reduces reliability by around 50 percent.

However, like a fever, the high temperature itself might not be the real problem, but instead an underlying symptom of what’s actually wrong (e.g., issues with power supply, memory, etc.). Therefore, you should check the CPU, chipset, and HDD temperatures, and check whether or not your fans are running properly.

If you can’t immediately determine the cause of excess heat, keep looking. Other possible causes of high temperature could include a clogged front intake, blockage of the exhaust or airflow, recent re-positioning of the machine, or a dirty heat sink.

Note: to figure out of your server is running too hot, you can probably check with your vendor for baselines; many models come with acceptable temperature operating specifications.

2) Constant reboots or random failures

Even a “healthy” server can give out if put under unusually excessive load (same goes for IT pros). Such failures in isolation are usually nothing to worry about. But a mysterious crash for no clear reason, on a server with no intensive process running on it? Cause for concern. Don’t just reboot and pray for the best. It’s time for a little CSI: server action.

  • Pour over event logs to see if you can find any explanation for the odd behavior
  • A physical check of the motherboard might be worthwhile to see if any components (such as capacitors in the power supply) are damaged
  • Running a memory test and reseating the memory sticks is a good idea
  • Check the server’s disk for errors
  • Use antivirus/anti-malware software to see if an infection or intrusion might be causing the crashes
  • Make sure that the server isn’t being put under undue stress (for example, you can use network monitoring software to alert you of high CPU, memory, or disk utilization)
 
Malware is one possible cause of server slowness and failure.

3) Computer hanging up and services failing

“My computer’s running slow!” is undoubtedly one of the most popular help desk ticket subject lines of all time, and the cause could be almost anything. With a server though, sudden slowness is often the result of deep-seated problems that could put it at risk for failure.

For example, a process may cause a memory leak that could eat up all of your system resources, which could result in the system grinding to a halt. A simple software update might fix things in these instances, but your system may crash for other reasons. For example, your Linux server might decide to go read-only if your hard drive is acting up. Or data corruption might be causing applications to randomly fail. Over time, tiny problems will start to add up, and if regular maintenance isn’t enough to consistently keep your server in working order — it may be time for a replacement.

Really slow data transfer rates are a huge bottleneck and a big red flag for hard drive problems, as are a rising number of bad sectors that don’t respond to read/write operations. Strange noises (for HDDs) are also a warning sign, much like the hypothetical noisy furnace we mentioned earlier.