Author(s): Amreth Chandrasehar
Organizations depends on Observability data to reliably operate systems and develop applications. The amount of data generated in the systems exponentially grows with business demand with new and existing customers, more products and features being released and because developers and operators require more data to analyze and debug applications. As the data grows, the complexity, mean time to detect (MTTD) and mean time to resolve (MTTR) grows as well. Using Machine Learning (ML) and Artificial Intelligence (AI) techniques discussed in this paper will improve MTTR, MTTD, reduce complexity of debugging issues using Observability data collected from the various distributed systems.
Introduction Observability is the level of visibility that the system grants to an outside observer. It ’s a property of a system, just like usability, availability, and scalability. Monitoring is for operating software/ systems Instrumentation is for writing software Observability is for understanding systems Investing in observability means to be prepared to spend the time on instrumenting systems, cope for the unknowns that come in production. It can be very simple at the start, such as some basic health-checks. With metrics, tracing, logging, correlations, structured logging, events; combined together it just brings a really powerful solution. When things may go wrong, Observability helps to react and recover faster.
Organizations usually have different data stores, each for logs, metrics, traces, events using multiple different tools. At times even federation of each tools may be challenging causing one massive monolith cluster having a lot of this data. This can be challenging for correlating data across different data sources with different formats and also for Data Scientists to create a model data to use in ML model development. This paper discusses how to solve these challenges, improve visibility and build successful AI and ML models using Observability data.
A machine learning model is a statistical model that is trained on data to make predictions or decisions. Machine learning models are used in a wide variety of applications, such as Predicting customer behavior, Fraud detection, medical diagnosis, Self-driving cars There are two main types of machine learning models: supervised learning and unsupervised learning.
Machine learning models can be used to analyze observability data to improve the reliability, performance, and security of systems. Below are the use cases of machine learning models using observability data:
Machine learning models developed using observability data can be challenging. Below are some of the challenges while building ML models:
Despite the challenges, machine learning models can be a powerful tool for improving the reliability, performance, and security of systems. By carefully considering the challenges and selecting the right machine learning model, organizations can use machine learning to improve their observability and make their systems more resilient.
Machine Learning models when rightly developed, it can transform organizations by improving reliability, security, performance efficiency, operational excellence and reduce cost. The diagram below is from AWS Well-architected framework that combines Observability, ML development and aligning organizational goals.
Observabilities 4 pillars - Logs, Metrics, Traces and Events are collected by agents and sent to a streaming layer. The data is then ingested into a unified Observability platform to process.
Observability standards are very important to standardize the data creation from applications across all services and platforms. This will help developers and operators of applications to easily correlate and have data traceability from UI to the DB layer. It is one of the most important step the organizations need to enforce and keep track frequently. This can also help the Data scientists to clean the data and for the ML engineers to develop ML models.
Observability is based on three main pillars: logging, metrics, and traces.
The correlations between observability data types can be used to gain a deeper understanding of the system. For example, by correlating metrics and logs, ML models can identify the events that led to a spike in CPU usage. Also by correlating metrics and traces, systems can identify the specific components of a system that are interacting with one and other.
By understanding the correlations between observability data types, a deeper understanding of the system can be gained and to identify potential problems before it causes outages or performance problems.
Below are some tips for identifying correlations between observability data types:
Applying Machine learning models on Observability (monitoring, telemetry) data provides actional insights and can be leveraged to proactively surface threats, identify anomalies, identify behavioral trends, accelerate problem resolution.
Anomaly detection addresses one of the core challenges in monitoring dynamic, responsive, ever-scaling infrastructure: How to define normal versus abnormal performance. Setting static thresholds often leads to false alarms due to normal variations in key metrics like website traffic and customer checkouts, which tend to rise and fall depending on the time of day, day of the week, or day of the month. Anomaly detection accounts for those expected variations, as well as long-term trends, to intelligently flag behavior that is truly unexpected. Datadog's anomaly detection algorithms are rooted in established statistical models but have been heavily adapted for the domain of high-scale infrastructure and application monitoring
Monitoring large fleets of servers, containers, IoT devices, or application instances makes it difficult to keep tabs on the health and performance of any individual member of the fleet. Datadog's outlier detection algorithms constantly evaluate large fleets or groups to identify if any member of the fleet starts behaving abnormally, as compared to its peers. Using outlier detection, engineers can automatically identify unhealthy application servers, databases, or other systems in need of maintenance, without having to define ahead of time what normal, healthy behavior looks like.
Even in dynamic systems, some limits are fixed, and breaching them can have severe consequences. When an application runs out of memory, or a database server runs out of disk space, the resulting crash can trigger a cascading failure and cause a userfacing outage. For resource constraints such as these, forecasting algorithms can be used to alert engineering teams with sufficient time to address the problem and avoid issues altogether. For instance, forecast alerts can notify teams a week before disk space is predicted to run out, based on recent trends and seasonal patterns in that system's disk usage. With Webhooks integration and monitoring APIs, teams can build automated AIOps (artificial intelligence for IT operations) workflows, such as archiving or deleting logs to reclaim disk space or provisioning more instances of an application to reduce the memory pressure on app servers.
When a system goes down or is affected by an unexpected issue, it can take hours to find the root cause of the problem and fix it, leading to prolonged service interruption and possibly loss of revenue. Watchdog RCA automatically detects anomalous behavior from across your applications and infrastructure, identifies the causal relationships among different symptoms, and clearly pinpoints the root cause. This approach enables you to resolve problems anywhere in your stack faster than ever, significantly reducing your mean time to resolution.
The end-to-end Machine Learning on Observability data contains data collection, data transformation, data storage, mode inferencing, action and resolution components. Below diagram shows the flow graph of data generation to an action taken to resolve an issue.
Various machine learning models can be used on observability data such as XGBoost (Extreme Gradient Boosting), Recurrent Neural Network (RNN), Auto-regressive integrated moving average (ARIMA), Support Vector Regression (SVR), Multilayer Perceptron (MLP), and Long Short-Term Memory Recurrent Neural Network (LSTM RNN) for predictions. In next section, using SARIMA model, CPU utilization of database will be used based on time series metrics.
Inefficient allocation of CPU resources and lack of proactive monitoring can lead to service failures and suboptimal resource utilization in RDS instances across different regions. There is a need for an advanced Time Series model that can accurately forecast CPU utilization, provide real-time insights, and enable proactive resource allocation to prevent service disruptions and optimize performance.
Implementing this CPU Utilization Forecasting model will result in optimized resource allocation, reduced service failures, and improved performance. The real-time insights provided by the model will allow teams to monitor CPU utilization and identify instances that are either overutilized or underutilized for longer durations. This information will enable them to optimize resource allocation, maintaining high performance levels while ensuring a buffer for sudden increases in CPU utilization. Ultimately, this proactive approach will prevent service failures, enhance resource optimization, and improve overall service availability and performance.
The model will consist of three pipelines: a data collection pipeline, a static threshold pipeline, and a forecasting model pipeline. The data collection pipeline will run every 30 minutes, collecting and storing CPU utilization data for all RDS instances in an S3 bucket. The static threshold pipeline will run every 30 days to update the static threshold for all instances. Finally, the forecasting model pipeline will run every hour, predicting CPU utilization for the next 30 minutes for all RDS instances. In the forecasting process, the model will compare the predicted CPU utilization with the static threshold if it is available for the respective instance. For instances without a static threshold, the model will compare the forecasted value with a predefined threshold of 80%. If the predicted value exceeds the threshold, an alert will be triggered for the respective team, allowing them to take corrective measures and proactively reallocate CPU resources.
Below pipeline can be run on MLOps tool to first calculate dynamic threshold and also to forecast database failures based on CPU usage data [2, 3].
Results in table below shows the value of forecasted connection count and the actual database connections during the model run.
The models have successfully predicted the failures using Observability data. This is significant as teams can proactively get alerted and take counter measures to protect the database and the application from failures. Database failures can be too common due to improper sizing, long running jobs, logs taking up disk storage, etc. Having ML models to predict these failures can prevent customers being impacted.
The paper provides various ML techniques to be applied on Observability data and also a practical example of using SARIMA model to predict DB failures. Organizations generate a lot of Observability data, but very few proactively put the data to use by applying Machine Learning models. ML models can help teams to predict failures to efficiently run operations as shown in DB forecasting example. The MTTR, MTTD can be vastly improve with these models and also reducing outages in the first place. The limitation of the model is, they are required to be retrained frequently to ensure the dynamic limits set in place are updated with changing metrics every minute. The model needs to be performant, run quickly at less cost, this can take several weeks to fine tune the models to run at scale, especially for 1000s of databases at large enterprise organizations.
There are various tools like Elasticsearch, Datadog that provides ML capabilities for non-ML engineers to quickly configure with built-in features. As a future direction, AIOps can also be used with Observability data or alerts created from Observability tools to correlate data across various tool and ML should be part of every organization ’s strategy to efficiently operate teams, incidents and innovate to be successful.