Category: Blog

Decrypting Technical Jargon

Technical Jargon Decrypted – Part 1

Technology and its use of jargon can be confusing and frustrating, especially for new users (newbies). Understanding the terminology early on will help lessen these feelings and improve the overall user experience from the start. While starting from the most basic level and building a foundation, each post will advance, as we progress through this series.

The basics:

Bit – A basic unit of information which can only have one of two values; off (zero) or on (one).

CPU (Central Processing Unit) – The brains of a computer, responsible for interpreting and executing commands from other hardware and/or software.

Byte – A unit of information containing 8 bits or the equivalent of one character.

WWW (World Wide Web) – Invented in 1989 by Tim Berners-Lee and is a global space where documents, pictures, movies, applications, etc., can be accessed via the internet by a URL.

KB (Kilobyte) – One thousand Bytes is the equivalent of one Kilobyte.

RAM (Random Access Memory) – Stores frequently accessed information for quick retrieval by the CPU. All data stored in RAM is lost if the device loses power.

IP (Internet Protocol) – An identifier assigned to a device or node allowing network communication.

URL (Universal Resource Locator) – Identify the resources (documents, pictures, movies, etc.) and are interlinked by HyperText.

HDD (Hard Disk Drive) – A fixed disk that uses magnetic technology to store and retrieve data.

HTTP (HyperText Transfer Protocol) – The foundation for communication on the World Wide Web. Uses logical links allowing navigation between resources and nodes.

GUI (Graphical User Interface) – A visual interface allowing interaction with the underlying hardware.

HTML (HyperText Markup Language) – Used to create webpages and is the building block of the World Wide Web.

CLI (Command Line Interface) – A text based interface allowing interaction with the underlying hardware by issuing commands.

ISP (Internet Service Provider) – A company that provides a service for accessing the internet; for example; AT&T, Spectrum, Comcast, etc.

ROM (Read Only Memory) – Like RAM with the exception that data is not lost when the device loses power.

SSL (Secure Sockets Layer) – A standard used to establish secure links between hosts. For example: a web server and a client web browser.

We have only scratched the surface by defining some of the commonplace terminology used in tech speak today. There are literally hundreds of great resources available for free on the web and best of all they are only a mouse click or two away.

Wes Johns
Sales Engineer

IT-Project-Help

Basics of Setting Up a Network Monitoring System – Part 2

Part 2 of 2
(Read part 1 here)

I have all this data coming in, now what?

Dashboard building

Once you have all your important systems in the monitoring tool, you have logical polling intervals, and everything is categorized and labeled you probably have a default summary page with a ton of red and green Christmas tree lights and severe looking red words all over it.

Your boss walks past the screen and sees red dots and error messages and starts asking why so many things are broken. Now you must explain that this is probably normal stuff and they shouldn’t worry, but what is the point of the monitoring tool if you are supposed to ignore half of what shows up on the screen?

How do you know which half is the important stuff and which can you ignore? This is where you start to tailor the tool to your environment and make it helpful. Those tags we set up earlier will be critical in this regard.

Often the initial summary page should be a basic snapshot of the current availability of the key services that impact nearly everyone in the organization. How do the domain controllers look? Can we still send emails? Are the main business offices and datacenters okay? Is the company website up?

Depending on which monitoring tools you have you may have different methods available to you to validate all these things but generally you want to find a way to simply display that a given service is available, unavailable, and maybe an indicator if things are degraded in some way but still not completely offline.

Keep things simple and high level, if the issue is directly relevant to someone they can drill in deeper.

In SolarWinds I will often build out lists of critical services as groups, and then display the statuses of those groups with a simple map made using the Network Atlas tool. You can get really elaborate with customizing icons and such here but big green, yellow, and red indicators do a perfectly fine job.

Going beyond that high-level service indicator, you might also want to include information about upcoming maintenance windows or changes. A simple custom HTML box with messages you want to get out there would do the job or a custom table with every device that is scheduled to be unmanaged this week.

Are there any significantly congested points on the network that might have wide ranging impacts such as the WAN interfaces? You can add in a filtered resource that just shows the current utilization of these circuits, or if there are many circuits just show any of them that are above their thresholds.

Probably useful to have a search box to help people jump to the specific device they were interested in if they logged in with a mission in mind. I would shy away from having resources that list every event happening in the environment here as there are going to be constant streams of noise and they are likely going to be too scattered to be very helpful without some filtering.

If your environment is larger you may also want to add tabs to this view or links to other dashboards where you split things up based on the support teams or types of monitored objects involved. You find that the layout that makes sense to one team is often not particularly relevant to another.

While the network team could want the environment grouped by Site names, the DBA team might not be as concerned with physical locations if their workloads are all run in the central datacenter. Maybe they would benefit more from sorting their objects out by the type of database on the server, Oracle vs MSSQL, or the environment, Prod vs Dev.

On these more detailed pages you will likely also want to get into displaying more charts to show how things change over time in the environment.

A Network team could benefit from a chart indicating average response time for network devices grouped by Site name over the last 24 hours.

An Applications team might want to see things like the average CPU and memory use of their servers but they ask for a rolling week in order to see day over day changes in the trends and alongside that they want a chart of the application’s active user sessions or data throughput.

Spending time talking to the consumers of this data can give you a lot of insight into what metrics they care about, and what format is most helpful in presenting it to them. Putting a table where you need a chart makes it hard to spot changes over time and charts are unnecessary if all you need is a current status.

Thresholds and Responses

A key element in monitoring is setting your thresholds. How much CPU load is enough to get your admin involved and at what point does slow response to pings warrant investigating?

You will find out of the box thresholds built into whatever tool you are using but you will need to tweak them to the reality of your environment. I typically find that the most effective way to use thresholds is to set my critical threshold to the value where I would expect someone to try to address the issue immediately.

If you know that a server uses a high amount of memory then leaving it with the default threshold of 90 is not efficient since that metric will always show as being critical, which makes the dashboards look like there are more problems than there are and gets the users into the habit of ignoring the red signs.

Similarly, if you have monitors set up on something like a SQL performance counters and your DBA tells you that they do not generally worry about the number of connected sessions then don’t set a critical value for this metric. If something is just nice to know or gives clues about what is going on but isn’t a main indicator, then I don’t want to get messages about it in my inbox.

Alert fatigue is a very common problem so I set my alerts only to notify me via email when we have crossed the critical threshold, and many metrics require me to stay above the threshold for a specified amount of time. This way I know that if something shows up in my inbox that it is probably important, instead of getting so many messages that I route them to a folder I never check.

I won’t need to address a short CPU spike, but a server that has been maxed out for 30 minutes is potentially worth considering.

When it comes to warning thresholds, I will reference these in my reports and on the dashboards so I have some opportunity to see how often devices are in that zone without getting numbed by a constant stream of emails.

Going farther into the topic of email alerts and thresholds, you will typically start off with simple global thresholds like “Notify me when memory utilization goes above 90%” but people eventually find that these rules are too generic.

It turns out that their database servers always use high percentages of memory, or they don’t care when the dev machines max out, or they have a rarely used utility that only has 1 GB ram but they don’t feel like they use it often enough to upgrade the resources.

As the use of the monitoring tools mature, people begin to find more and more exceptions to these global rules and one-off little edge cases. Instead of carving out all kinds of exclusions from the standard memory alert saying “Don’t email me if the server is a database, and not if it is in dev, and not if today is Tuesday” it is more efficient to just get to a place where you have individual thresholds per device.

In SolarWinds you would do this by changing the trigger conditions to a Double Value Comparison and instead of saying Memory Percent is greater than 90 you set it Memory Percent is greater than Memory Critical Threshold. Now it will check on a per device basis against the thresholds you have set in the node properties.

It would seem like doing it individually would be less scalable but in practice it is a lot easier to have the granular threshold capabilities rather than having several variations on the same alert tweaked for each individual edge case. The duplicate alert scenario eventually ends up with unintentional gaps or duplicate alerts because years from now people have forgotten the edge cases that are already in the system they don’t want to go back and check them all.

You can set these thresholds in bulk from the Manage Nodes screen and there are methods to automatically set them based on custom properties so you don’t have to manage them actively.

I also mentioned reports. I try not to do any email alerts based on predictions of use trends because ultimately a prediction is always a guess. This kind of information makes more sense in a report that you can run periodically, gather all the dangerous looking trends into one place rather than separately investigating each disk volume that looks like it might fill up in 3 weeks.

When building reports and email messages I always try to think about all the additional information I have that might be useful to include in the message. If I get an email indicating that CPU load on a server has been high then it can be useful to include information like what OS does it run, how many cpu cores does it have, what is the threshold on this server, and what application is it associated with.

As a senior admin, you may already have all this information in your head but in a big environment there might be so many servers that no single person knows them all. Including as much context information as you can in the alert will help the people who end up dealing with the alert to remember how things are connected and gets problems resolved faster.

If there are common troubleshooting steps or issues associated with a particular server including a comment about that in the server’s tags helps to get your institutional knowledge documented and available at the times when it is most needed.

So, as you see, there is a lot to keep in mind when setting up a monitoring system that is effective at tracking the health of your environment but a little planning and strategy can dramatically improve the results you get from yours.

Marc Netterfield
Field Systems Engineer

IT-Project-Help

Basics of Setting Up a Network Monitoring System – Part 1

Part 1 of 2
(read part 2 here)

While working with my clients to help them set up monitoring systems I often see the same trouble spots come up. It is surprisingly easy to set up any monitoring system in a way that causes a flood of red and green blinking lights and hundreds of alerts each day that provide more heat than light when it comes to keeping an eye on your infrastructure.

You can improve the effectiveness of the monitoring system by employing a few techniques. As we primarily work with the Solarwinds suite of products I will use that as a reference point, but these concepts apply universally.

What to monitor?

If your resources for the monitoring system are unlimited then the easy answer to this is – everything.  

In practice, very few organizations commit unlimited resources to monitoring. You find yourself limited in terms of server resources for the monitoring system itself, or network capacity to the polled objects, or sometimes just based on software license limitations.

It might be nice to have down-to-the-minute data points going back 2 years for everything from the firewall to the workstations but that can take a lot of CPU and IOPS to be able to query all those data points and load it onto a web page in a reasonable amount of time. Whatever your limitations are you will need to consider those when setting up your system.

Bring in the highest impact infrastructure first, and then progressively expand coverage to the less critical systems. On the network side this is typically Firewalls and/or routers, core and distribution switches, and then access switches. You might start out with just monitoring the uplinks between network devices rather than monitoring and alerting on every single port.

You will also want to think about if you get any useful information from monitoring some types of virtual interfaces such as loopbacks, nulls, or routed subinterfaces. If I have the option, I like to set up monitors for hardware like UPS systems and CRAC units because I like to bring all the data to the central pane of glass, but in a squeeze these types of appliances typically have adequate alerting capabilities built in and might need to be skipped until more resources become available.

On the Application monitoring side of things, I try to start out with at least checking that my important processes and services are running and then expand into the less clear cut things like performance counters and synthetic transactions.  Depending on the importance of a system I might just do a basic check if my website loads rather than pulling a dozen performance metrics about how many active connections the web server has and how many bytes of pages are being transmitted.

You will want to always check yourself by asking “What am I going to be able to do with the information that I get from monitoring this?” I’ve seen people monitoring IP addresses that nobody has any idea what they are for, but they picked up it up as part of a scan and never got any further than that.

How Frequently to Monitor?

Look at your business needs, the resources of your monitoring tool, and your workflow when it comes to reacting to alerts to figure out how frequently you need to poll.

Checking for up/down status every 30 seconds might sound like a good idea but if you know that your team is so stretched that they rarely get around to looking at incoming messages then building that tight interval into the monitoring system is just going to inflate your hardware requirements without providing a measurable difference in the resolution times.

If your polling system is having a hard time keeping up, then making small increases in the time between polls can help. On the other hand, if you have important WAN interfaces that are getting hammered and you need to focus on that then perhaps you could increase the frequency of the monitoring on just those key interfaces.

You just need to keep track of wherever you have adjusted your standard intervals and why so that you have granularity where it helps and not where it is excessive.

When thinking about the workflow, is anyone in your environment going to jump up at night if CPU loads go high for a minute or would loads need to stay high for a sustained period before anyone would be bothered to check it out?

If you know your team wouldn’t investigate a load issue that didn’t last 30 minutes, then a 5-10 minute polling interval for CPU is probably just fine. If the app team complains that their server has erratic spikes in load that aren’t showing up in the monitoring and they are having a hard time correlating that to the other performance counters you are looking at on the server then you may have a case for a tighter interval, at least temporarily while the issue is investigated.

How to organize the monitored objects?

If your environment is going to be larger than a small office then it is important for you to think through how you want to organize the object you are monitoring.

In any good monitoring software, you will find some method in place to categorize things and with Solarwinds that is done with Custom Properties. These are essentially tags that can be applied to a monitored object that you can reference elsewhere, such as when you are building alerts, reports, or dashboards.

You need to think about how your environment is laid out and how to translate that into your organizational scheme. Are there separate Production and Dev environments?  Are IT assets associated with specific teams or departments?  Are there existing SLA’s associated with different types of objects that would dictate your response to an alert?

If you have external support or maintenance contracts, what is the contact information of the vendor who supports that system? What applications are associated with a server? What job does that server do within the application?

If you already have any kind of existing CMDB then there is a good chance that the information you need already exists there and if you match up the tags in the monitoring tool to the fields being used in the CMDB you can import the information into the monitoring tool and ensure consistency across the tools your organization uses.

Populating all this kind of information directly in the monitoring tool allows you to pull that information up alongside the monitored metrics and streamline the response in case of an issue.

If you are doing an integration from the monitoring tool to your ticketing system, then you would need to tag the monitored objects with all the necessary information to direct tickets to the appropriate queues.

In part 2 I will be discussing how to use thresholds and build dashboards that make it easy to sift through all the data that your monitoring tool is bringing in.

Marc Netterfield
Field Systems Engineer

IT-Project-Help

A Workflow for How To Write A Service Outage Notification

Outage-Notification-Diagram

Although it’s certainly the goal of every company, and IT professionals specifically, to avoid a service interruption to their customers, they are inevitable given enough time.

IT professionals do not spend much time sending communications directly to customers. Even planned updates, upgrades, or service changes are typically shared with the customer service team, who will carefully craft a message.

But what about unplanned service interruptions that happen at 4am and require immediate action? These are the times when proper and effective communication to the customer is crucial.

If you do not already have a template in place, you can use the following guidelines to craft one today.

Essential Structure Of A Notification Email:

Send Immediately – If your customers have not already realized the outage or disruption, they will soon. The faster you’re able to notify your clients, the more on top of the issue you will appear. This will give them the confidence that you are in control and doing everything you can to restore services.

Quality over Quantity – Get to the point. Try to be more like a stop sign and less like a singing telegram. Depending on how disrupted your customers are, they may not have the time to read through non-essential details. In order to effectively communicate your message, it’s best to provide the most important information in as few words as possible.

Honest Explanations – The fact that your customer’s service is out is all they care about. Therefore, it serves no purpose to give excuses or point fingers. Be honest about what the issue is and then go back to working on a resolution.

No Need For Apologies – You may genuinely feel bad for the customers who are affected by the system outage but telling them how sorry you are will do nothing to resolve the issue or make them feel any better.

Be Serious Not Friendly – It’s completely understandable that you would want to use kindness to try to make the pill easier to swallow. However, no matter how nice you are, they will still be without some service that is necessary or critical to their business. It’s a serious matter so you should have a serious tone.

Notification Email Examples

The information and layout you choose for your system outage notification will vary based on your unique business needs, customer type, industry, and other factors. However, there is a general outline that most notifications follow.

Generic Notice from XYZ Company

From:

Subject: Unplanned service outage – OR Issue with

Opening paragraph should include:

  • Names of services interrupted or affected
  • Approximate time the outage began (or when problem was identified)
  • Day and date of the outage
  • Describe the ways end users are affected (assuming your customer base is diverse, be specific about which subgroups, which platforms, what areas of the services are affected, is the service “unavailable” or just experiencing delays) — be sure to describe the issues a customer would be experiencing as a result of the outage.

Example: One of our data centers has been experiencing problems since approximately 6:00 a.m. on Wednesday, Feb 10. Users on shared server plans may be unable to access their server(s) during this time.

Closing paragraph should include:

  • Explain what you company is doing to resolve the issue. This should be brief and direct.
  • Provide a way for your customers to monitor updates or set expectations for how you will communicate future updates.

Example: Our engineers are working resolve the issue. Once the issue has been resolved, we will email all users.

And that’s it. Between those two paragraphs, you will convey all of the information your customers need to understand that you acknowledge there is an issue and that someone is working to resolve that issue.

Good luck!

Log4Shell Vulnerability covered by Runecast - Request a Vulnerability Assessment Request Assessmentx