Category: Blog

Basics of Setting Up a Network Monitoring System – Part 1

Part 1 of 2
(read part 2 here)

While working with my clients to help them set up monitoring systems I often see the same trouble spots come up. It is surprisingly easy to set up any monitoring system in a way that causes a flood of red and green blinking lights and hundreds of alerts each day that provide more heat than light when it comes to keeping an eye on your infrastructure.

You can improve the effectiveness of the monitoring system by employing a few techniques. As we primarily work with the Solarwinds suite of products I will use that as a reference point, but these concepts apply universally.

What to monitor?

If your resources for the monitoring system are unlimited then the easy answer to this is – everything.  

In practice, very few organizations commit unlimited resources to monitoring. You find yourself limited in terms of server resources for the monitoring system itself, or network capacity to the polled objects, or sometimes just based on software license limitations.

It might be nice to have down-to-the-minute data points going back 2 years for everything from the firewall to the workstations but that can take a lot of CPU and IOPS to be able to query all those data points and load it onto a web page in a reasonable amount of time. Whatever your limitations are you will need to consider those when setting up your system.

Bring in the highest impact infrastructure first, and then progressively expand coverage to the less critical systems. On the network side this is typically Firewalls and/or routers, core and distribution switches, and then access switches. You might start out with just monitoring the uplinks between network devices rather than monitoring and alerting on every single port.

You will also want to think about if you get any useful information from monitoring some types of virtual interfaces such as loopbacks, nulls, or routed subinterfaces. If I have the option, I like to set up monitors for hardware like UPS systems and CRAC units because I like to bring all the data to the central pane of glass, but in a squeeze these types of appliances typically have adequate alerting capabilities built in and might need to be skipped until more resources become available.

On the Application monitoring side of things, I try to start out with at least checking that my important processes and services are running and then expand into the less clear cut things like performance counters and synthetic transactions.  Depending on the importance of a system I might just do a basic check if my website loads rather than pulling a dozen performance metrics about how many active connections the web server has and how many bytes of pages are being transmitted.

You will want to always check yourself by asking “What am I going to be able to do with the information that I get from monitoring this?” I’ve seen people monitoring IP addresses that nobody has any idea what they are for, but they picked up it up as part of a scan and never got any further than that.

How Frequently to Monitor?

Look at your business needs, the resources of your monitoring tool, and your workflow when it comes to reacting to alerts to figure out how frequently you need to poll.

Checking for up/down status every 30 seconds might sound like a good idea but if you know that your team is so stretched that they rarely get around to looking at incoming messages then building that tight interval into the monitoring system is just going to inflate your hardware requirements without providing a measurable difference in the resolution times.

If your polling system is having a hard time keeping up, then making small increases in the time between polls can help. On the other hand, if you have important WAN interfaces that are getting hammered and you need to focus on that then perhaps you could increase the frequency of the monitoring on just those key interfaces.

You just need to keep track of wherever you have adjusted your standard intervals and why so that you have granularity where it helps and not where it is excessive.

When thinking about the workflow, is anyone in your environment going to jump up at night if CPU loads go high for a minute or would loads need to stay high for a sustained period before anyone would be bothered to check it out?

If you know your team wouldn’t investigate a load issue that didn’t last 30 minutes, then a 5-10 minute polling interval for CPU is probably just fine. If the app team complains that their server has erratic spikes in load that aren’t showing up in the monitoring and they are having a hard time correlating that to the other performance counters you are looking at on the server then you may have a case for a tighter interval, at least temporarily while the issue is investigated.

How to organize the monitored objects?

If your environment is going to be larger than a small office then it is important for you to think through how you want to organize the object you are monitoring.

In any good monitoring software, you will find some method in place to categorize things and with Solarwinds that is done with Custom Properties. These are essentially tags that can be applied to a monitored object that you can reference elsewhere, such as when you are building alerts, reports, or dashboards.

You need to think about how your environment is laid out and how to translate that into your organizational scheme. Are there separate Production and Dev environments?  Are IT assets associated with specific teams or departments?  Are there existing SLA’s associated with different types of objects that would dictate your response to an alert?

If you have external support or maintenance contracts, what is the contact information of the vendor who supports that system? What applications are associated with a server? What job does that server do within the application?

If you already have any kind of existing CMDB then there is a good chance that the information you need already exists there and if you match up the tags in the monitoring tool to the fields being used in the CMDB you can import the information into the monitoring tool and ensure consistency across the tools your organization uses.

Populating all this kind of information directly in the monitoring tool allows you to pull that information up alongside the monitored metrics and streamline the response in case of an issue.

If you are doing an integration from the monitoring tool to your ticketing system, then you would need to tag the monitored objects with all the necessary information to direct tickets to the appropriate queues.

In part 2 I will be discussing how to use thresholds and build dashboards that make it easy to sift through all the data that your monitoring tool is bringing in.

Marc Netterfield
Field Systems Engineer

IT-Project-Help

A Workflow for How To Write A Service Outage Notification

Outage-Notification-Diagram

Although it’s certainly the goal of every company, and IT professionals specifically, to avoid a service interruption to their customers, they are inevitable given enough time.

IT professionals do not spend much time sending communications directly to customers. Even planned updates, upgrades, or service changes are typically shared with the customer service team, who will carefully craft a message.

But what about unplanned service interruptions that happen at 4am and require immediate action? These are the times when proper and effective communication to the customer is crucial.

If you do not already have a template in place, you can use the following guidelines to craft one today.

Essential Structure Of A Notification Email:

Send Immediately – If your customers have not already realized the outage or disruption, they will soon. The faster you’re able to notify your clients, the more on top of the issue you will appear. This will give them the confidence that you are in control and doing everything you can to restore services.

Quality over Quantity – Get to the point. Try to be more like a stop sign and less like a singing telegram. Depending on how disrupted your customers are, they may not have the time to read through non-essential details. In order to effectively communicate your message, it’s best to provide the most important information in as few words as possible.

Honest Explanations – The fact that your customer’s service is out is all they care about. Therefore, it serves no purpose to give excuses or point fingers. Be honest about what the issue is and then go back to working on a resolution.

No Need For Apologies – You may genuinely feel bad for the customers who are affected by the system outage but telling them how sorry you are will do nothing to resolve the issue or make them feel any better.

Be Serious Not Friendly – It’s completely understandable that you would want to use kindness to try to make the pill easier to swallow. However, no matter how nice you are, they will still be without some service that is necessary or critical to their business. It’s a serious matter so you should have a serious tone.

Notification Email Examples

The information and layout you choose for your system outage notification will vary based on your unique business needs, customer type, industry, and other factors. However, there is a general outline that most notifications follow.

Generic Notice from XYZ Company

From: [Your company name]

Subject: Unplanned service outage – [KEY SERVICE NAMES] OR Issue with [KEY SERVICE NAMES]

Opening paragraph should include:

  • Names of services interrupted or affected
  • Approximate time the outage began (or when problem was identified)
  • Day and date of the outage
  • Describe the ways end users are affected (assuming your customer base is diverse, be specific about which subgroups, which platforms, what areas of the services are affected, is the service “unavailable” or just experiencing delays) — be sure to describe the issues a customer would be experiencing as a result of the outage.

Example: One of our data centers has been experiencing problems since approximately 6:00 a.m. on Wednesday, Feb 10. Users on shared server plans may be unable to access their server(s) during this time.

Closing paragraph should include:

  • Explain what you company is doing to resolve the issue. This should be brief and direct.
  • Provide a way for your customers to monitor updates or set expectations for how you will communicate future updates.

Example: Our engineers are working resolve the issue. Once the issue has been resolved, we will email all users.

And that’s it. Between those two paragraphs, you will convey all of the information your customers need to understand that you acknowledge there is an issue and that someone is working to resolve that issue.

Good luck!