System Reliability – Does it Matter?

Published 3/8/23

System security and reliability are two sides of the same coin. On one hand, system security ensures that attacks do not breach system integrity and can recover from an attack. On the other hand, system reliability ensures that the system operates correctly. Reliability emphasizes the ability of equipment to function without failure.  They are both dependent on each other: the synergy of security and reliability is the cornerstone of a smooth-running, functional system for years to come.

With this in mind, you should be using services such as OneDrive, Box, or a similar service to backup and/or store your files.  It is important that whatever back up strategy you utilize is tested regularly to make sure you are protected against determined hackers.

Next, document all your systems, assigning criticality levels to each part of your infrastructure – this will help you to determine how to protect them from going offline.  After you have the list how do you improve their reliability?

Let’s take a web application that you host as an example.  You will need redundant internet, firewalls, ethernet switches, web and database servers.  Each server should have redundant power supplies and ethernet ports. Any single component WILL fail eventually!

You may also need to invest in a redundant data center.  This is especially important if your primary location is in an area that is at a high risk for natural disasters like hurricanes and earthquakes.

If you have important systems in your office environment, you will need to review the redundancy there as well. All of this redundancy should be tested at least annually.

Some outages are caused by employee mistakes.  So to minimize potential issues, limit the number of employees that have access to critical systems, both management and physical access.

To develop a program to improve the reliability of your systems, track all outages and conduct a Root Cause Analysis (RCA) and identify the following:

  1. When the issue occurred and how long was it down
  2. What systems were affected
  3. How did the issue get reported, (with the ideal that you have automated notifications from system monitoring, so you know before a customer calls you)
  4. Why it went down
  5. What can be done to avoid it in the future
  6. What worked and did not work on the response

Create a change management plan that includes:

  • An approval process for changes
  • Communication plan for before, during, and after the change
  • Opportunity to bring all departments together to avoid unintended consequences
  • Post change roses and thorns meeting to learn and become better at the change process

With proper procedures and planning you can build and run more secure and reliable systems!