Welcome to the fourth segment in our Observability and Maturity Series, Observability Data and Metrics.
In part four, we build on the earlier posts and introduce the concept of Observability Data and Metrics.
Our view at Loop1 is that monitoring is the foundation of observability. IT Operations teams need both to have effective, full-stack, end-to-end application delivery or service assurance.
In the L1M3 model, Observability Data and Metrics refers to identifying, instrumenting, gathering, and storing our performance metrics, status metrics, syslogs, application logs, event logs, application traces, SNMP traps, etc.
With the concept of Observability Data and Metrics, we are focused on ensuring we collect the right information with the right amount of fidelity and then store and retain that data in ways that protect Network Management System (NMS) performance while still supporting robust data analytics and trend analysis over the long term.
Fundamentally, monitoring and observability are siblings. They both involve processing metrics and logs. Where monitoring predominantly refers to a process of identifying, collecting, analyzing, and alerting against thresholds, observability generally refers to the methods of log collection and analysis to determine the state of the system based on the log data. Further, observability explicitly adds the concept of code profiling via system traces, which extends beyond the realm of traditional monitoring, and supports not only application monitoring but also application debugging.
Monitoring also leverages SNMP traps and Syslog messages, which are received from devices, similar to how observability might receive log data from applications.
Regarding metrics, we have the concept of ‘polling’ and ‘polling intervals.’ For each metric, we must determine the appropriate frequency with which we want to gather the data from each element (node, interface, volume, application component, operation, etc.).
That frequency of polling, or polling interval, will determine the level of analysis we can perform on that metric, the trends we can identify, and the thresholds we might use in our alerting mechanisms. If we don’t collect it, we can’t display it.
To have high-performance applications, we need good monitoring and observability data. Once that data is gathered, we can then use that data not only to ensure the performance and availability of our applications and services but also for data analytics. We can leverage tools like Power BI, Tableau, Qlick, or others to assess longer-term trends and perform capacity planning. We can use our data analytics to do better forecasting and budgeting. With better forecasts and budgets, we can also then apply that data to staffing, and from there even to training, to develop training plans and budgets.
In summary, the L1M3 Assessment and planning concept of Observability Data and Metrics is the methodology that ensures we collect the correct data to support both our real-time monitoring needs and our long-term data analytics needs.
When done right, it enables a more mature, proactive IT operations capability that helps our organization understand the return on technology investments, reduces risk, and improves organizational agility.
In our next blog post in this Observability and Maturity Series we introduce the Security and Compliance assessment area and discuss why an intrinsic approach to security and compliance is expected in IT Operations today.