Part 2 of 2
(Read part 1 here)
I have all this data coming in, now what?
Once you have all your important systems in the monitoring tool, you have logical polling intervals, and everything is categorized and labeled you probably have a default summary page with a ton of red and green Christmas tree lights and severe looking red words all over it.
Your boss walks past the screen and sees red dots and error messages and starts asking why so many things are broken. Now you must explain that this is probably normal stuff and they shouldn’t worry, but what is the point of the monitoring tool if you are supposed to ignore half of what shows up on the screen?
How do you know which half is the important stuff and which can you ignore? This is where you start to tailor the tool to your environment and make it helpful. Those tags we set up earlier will be critical in this regard.
Often the initial summary page should be a basic snapshot of the current availability of the key services that impact nearly everyone in the organization. How do the domain controllers look? Can we still send emails? Are the main business offices and datacenters okay? Is the company website up?
Depending on which monitoring tools you have you may have different methods available to you to validate all these things but generally you want to find a way to simply display that a given service is available, unavailable, and maybe an indicator if things are degraded in some way but still not completely offline.
Keep things simple and high level, if the issue is directly relevant to someone they can drill in deeper.
In SolarWinds I will often build out lists of critical services as groups, and then display the statuses of those groups with a simple map made using the Network Atlas tool. You can get really elaborate with customizing icons and such here but big green, yellow, and red indicators do a perfectly fine job.
Going beyond that high-level service indicator, you might also want to include information about upcoming maintenance windows or changes. A simple custom HTML box with messages you want to get out there would do the job or a custom table with every device that is scheduled to be unmanaged this week.
Are there any significantly congested points on the network that might have wide ranging impacts such as the WAN interfaces? You can add in a filtered resource that just shows the current utilization of these circuits, or if there are many circuits just show any of them that are above their thresholds.
Probably useful to have a search box to help people jump to the specific device they were interested in if they logged in with a mission in mind. I would shy away from having resources that list every event happening in the environment here as there are going to be constant streams of noise and they are likely going to be too scattered to be very helpful without some filtering.
If your environment is larger you may also want to add tabs to this view or links to other dashboards where you split things up based on the support teams or types of monitored objects involved. You find that the layout that makes sense to one team is often not particularly relevant to another.
While the network team could want the environment grouped by Site names, the DBA team might not be as concerned with physical locations if their workloads are all run in the central datacenter. Maybe they would benefit more from sorting their objects out by the type of database on the server, Oracle vs MSSQL, or the environment, Prod vs Dev.
On these more detailed pages you will likely also want to get into displaying more charts to show how things change over time in the environment.
A Network team could benefit from a chart indicating average response time for network devices grouped by Site name over the last 24 hours.
An Applications team might want to see things like the average CPU and memory use of their servers but they ask for a rolling week in order to see day over day changes in the trends and alongside that they want a chart of the application’s active user sessions or data throughput.
Spending time talking to the consumers of this data can give you a lot of insight into what metrics they care about, and what format is most helpful in presenting it to them. Putting a table where you need a chart makes it hard to spot changes over time and charts are unnecessary if all you need is a current status.
Thresholds and Responses
A key element in monitoring is setting your thresholds. How much CPU load is enough to get your admin involved and at what point does slow response to pings warrant investigating?
You will find out of the box thresholds built into whatever tool you are using but you will need to tweak them to the reality of your environment. I typically find that the most effective way to use thresholds is to set my critical threshold to the value where I would expect someone to try to address the issue immediately.
If you know that a server uses a high amount of memory then leaving it with the default threshold of 90 is not efficient since that metric will always show as being critical, which makes the dashboards look like there are more problems than there are and gets the users into the habit of ignoring the red signs.
Similarly, if you have monitors set up on something like a SQL performance counters and your DBA tells you that they do not generally worry about the number of connected sessions then don’t set a critical value for this metric. If something is just nice to know or gives clues about what is going on but isn’t a main indicator, then I don’t want to get messages about it in my inbox.
Alert fatigue is a very common problem so I set my alerts only to notify me via email when we have crossed the critical threshold, and many metrics require me to stay above the threshold for a specified amount of time. This way I know that if something shows up in my inbox that it is probably important, instead of getting so many messages that I route them to a folder I never check.
I won’t need to address a short CPU spike, but a server that has been maxed out for 30 minutes is potentially worth considering.
When it comes to warning thresholds, I will reference these in my reports and on the dashboards so I have some opportunity to see how often devices are in that zone without getting numbed by a constant stream of emails.
Going farther into the topic of email alerts and thresholds, you will typically start off with simple global thresholds like “Notify me when memory utilization goes above 90%” but people eventually find that these rules are too generic.
It turns out that their database servers always use high percentages of memory, or they don’t care when the dev machines max out, or they have a rarely used utility that only has 1 GB ram but they don’t feel like they use it often enough to upgrade the resources.
As the use of the monitoring tools mature, people begin to find more and more exceptions to these global rules and one-off little edge cases. Instead of carving out all kinds of exclusions from the standard memory alert saying “Don’t email me if the server is a database, and not if it is in dev, and not if today is Tuesday” it is more efficient to just get to a place where you have individual thresholds per device.
In SolarWinds you would do this by changing the trigger conditions to a Double Value Comparison and instead of saying Memory Percent is greater than 90 you set it Memory Percent is greater than Memory Critical Threshold. Now it will check on a per device basis against the thresholds you have set in the node properties.
It would seem like doing it individually would be less scalable but in practice it is a lot easier to have the granular threshold capabilities rather than having several variations on the same alert tweaked for each individual edge case. The duplicate alert scenario eventually ends up with unintentional gaps or duplicate alerts because years from now people have forgotten the edge cases that are already in the system they don’t want to go back and check them all.
You can set these thresholds in bulk from the Manage Nodes screen and there are methods to automatically set them based on custom properties so you don’t have to manage them actively.
I also mentioned reports. I try not to do any email alerts based on predictions of use trends because ultimately a prediction is always a guess. This kind of information makes more sense in a report that you can run periodically, gather all the dangerous looking trends into one place rather than separately investigating each disk volume that looks like it might fill up in 3 weeks.
When building reports and email messages I always try to think about all the additional information I have that might be useful to include in the message. If I get an email indicating that CPU load on a server has been high then it can be useful to include information like what OS does it run, how many cpu cores does it have, what is the threshold on this server, and what application is it associated with.
As a senior admin, you may already have all this information in your head but in a big environment there might be so many servers that no single person knows them all. Including as much context information as you can in the alert will help the people who end up dealing with the alert to remember how things are connected and gets problems resolved faster.
If there are common troubleshooting steps or issues associated with a particular server including a comment about that in the server’s tags helps to get your institutional knowledge documented and available at the times when it is most needed.
So, as you see, there is a lot to keep in mind when setting up a monitoring system that is effective at tracking the health of your environment but a little planning and strategy can dramatically improve the results you get from yours.
Field Systems Engineer