The Single Point of Failure

The SolarWinds Orion platform is used across every industry to monitor systems of all shapes and sizes. Those familiar with Orion have probably seen it used to monitor network devices for performance and up/down status, servers and applications for similar metrics, and a wide range of other functions such as flow monitoring, configuration backup, and change detection. Of all the capabilities of Orion, the most under-the-radar feature still used in almost every environment in some form or fashion is the ability to monitor points of failure within the network.

This takes many forms, including the Groups and Dependency functionality and the ability to monitor High Availability (HA) status for things like Cisco ASAs or F5 Load Balancers. Regardless of how it is implemented or alerted on, the common theme is that organizations are keen to track the status of critical points of failure within their environment. Despite this emphasis on monitoring points of failure within the network, there is an alarming trend among many organizations: relying on a single point of failure for their monitoring system.

In the case of Orion, this is frequently a two-fold problem.

1. Reliance on a single application server

The first problem is the reliance on a single Application Server. While some organizations have the ability to VMotion virtual servers from one host to another in the event of failure, too many are left in a situation where hardware failure would result in the loss of monitoring across the board or, at the very least, the loss of monitoring provided by an additional polling engine.

SolarWinds offers a High Availability solution for primary and additional polling engines that is significantly cheaper than standing up a complete secondary Orion environment that will ensure minimal downtime of the monitoring system in the event of hardware failure. Unfortunately, despite the availability and value of this product, the most common answer when asking Orion Admins about Orion High Availability is, “What’s that?”

2. Neglect of the SQL Database Server

The second major issue is less visible but possibly even more important: the SQL Database Server. SQL has a range of high availability solutions (such as Always On) and backup options that can mitigate or prevent downtime in the event of hardware failure, but all-too-often the Orion SQL Server is set up during the initial deployment and then forgotten about until someone notices poor performance in the Orion application. Even then, the focus tends to be on the health of the database itself, and little mind is paid to the backup and high availability options despite the fact that a failure of the SQL Database Server will cripple the entire monitoring system

Mitigate the single points of failure

It is absolutely critical that Orion administrators take care to mitigate the single points of failure inherent in their monitoring system. Failure to do so could result in the loss of the very platform they rely on to monitor these same issues in other systems throughout their environment.

Invest in high-availability solutions for your monitoring system so a catastrophic event doesn’t kill the tool you rely on to alert you to the incident.