Chapter 6B: Model and Operational Monitoring and Observability Metrics

Introduction

Monitoring machine learning (ML) models in production is essential to ensure that they continue to deliver operational value and perform reliably over time. As agencies increasingly deploy ML in mission-critical services, they must establish robust monitoring strategies that cover both functional and operational dimensions.

Key Questions for Metric Selection

When deciding what metrics to monitor, agencies should start by asking the following questions:

What are the critical success criteria for the model?
What business or policy goals does the model support?
What are the known risks or failure modes?
Are ground truth labels available in real-time or with delay?
What metrics can provide real-time actionable insights?

Best Practices for Metric Selection

Based on the answers that you have provided for the questions above, it is important to then define some criteria when performing metric selection so that it is in line with ML monitoring best practices. Some of these best practices include, but are not limited to:

Choosing metrics that are comparable across model versions.
Ensuring that metrics are simple, explainable, and interpretable.
Preferring real-time collectable metrics for immediate observability.
Metrics should allow for clear alerting and response actions.

Challenges in Monitoring

However, when it comes to performing monitoring in the production environment, it is not uncommon to encounter some of these issues:

Absence of ground truth (e.g., loan default labels that only arrive months later).
Noise in input data pipelines.
Complexity in defining thresholds for alerts.
Latency in feedback loops.

Functional vs. Operational Monitoring

There are generally two levels of monitoring that you should be aware of when monitoring your machine learning model in the production environment:

Functional Monitoring: Focuses on model behaviour, such as input data, model logic, and predictions.
Operational Monitoring: Focuses on infrastructure, such as system health, performance, and cost.

Functional Monitoring

Input-Level Functional Monitoring (Data)

Data Quality and Integrity (Data Testing)

Data quality and integrity issues mostly result from changes in the data pipeline. To validate production data integrity before it gets to the model, we have to monitor certain metrics based on data properties. This way, if the input data isn’t what we expect or what we think the model expects, an alert can be triggered for the data team or service owner to take a look. Data quality issues tend to originate from upstream systems, data ingestion pipelines, or schema changes. Some examples include:

Missing Values: Missing income fields in a benefits eligibility model.
Schema Changes: Renaming or removing columns.
Out-of-range values: Negative ages, corrupted text data.

Some of the detection techniques that you can adopt to identify data quality and integrity issues include, but are not limited to:

Schema validation checks.
Duplicated data checks.
Null/missing value percentage checks.
Range and distribution checks.
Unit tests on ingestion pipelines.

When identifying data quality and integrity issues, some possible resolutions to adopt can include:

Alert data owners.
Enforce proper data validation practices by data owners.
Roll back to last known good schema.
Implement data validation gates.

Data/Feature Drift

Data drift refers to changes in the statistical distribution of input data features over time. Even if ground truth isn't immediately available, you can monitor the characteristics of incoming data and compare them to the training data baseline. Feature drift is essentially data drift examined on a per-feature basis – for instance, tracking if the distribution of "age" in a demographics model shifts significantly from the training distribution. Significant data/feature drift might warn that the model is receiving data that's qualitatively different from what it was trained on, potentially foreshadowing performance issues. Some examples include:

A fraud detection model receives more international transactions post-policy change.
Housing price pattern changes due to an economic crisis, affecting a resale price prediction model.

Some of the detection techniques that you can adopt to identify data/feature drift include, but are not limited to:

Time-windowed comparison of mean, variance, minimum and maximum values, and correlation scores.
For continous features, using divergence and distance tests including: Kullback-Leibler divergence, Kolmogorov-Smirnov statistics, Population Stability Index, Hellinger distance, etc.
For categorical features, chi-squared test, entropy, cardinality or frequency of the feature.

When identifying issues related to data/feature drift, some possible resolutions to adopt can include:

Trigger an alert to kickoff a retraining job on production data and test using shadow deployment or A/B testing.
Use adaptive learning techniques such as combining new data with historical data, and assigning higher weights to the features that drifted significantly.

Outliers

Outliers are rare events or data points that fall outside the expected distribution. This can inevitably affect model performance, as outliers don’t have sufficient learnable structure across the entire dataset, which will cause the model to return an unreliable response. Some examples include:

Extremely high transaction amounts in benefit fraud models.

Some of the detection techniques that you can adopt to identify outliers include, but are not limited to:

Z-score, Isolation Forest, One-Class SVM.
Performing statistical distance tests on recent events to detect out-of-distribution issues.
Performing distribution tests to determine how far off features from the production environment are from the features in the training set.
Unsupervised learning methods to categorise model inputs and predictions, discovering anomalous examples and predictions.

When identifying issues related to outliers, some possible resolutions to adopt can include:

Flag for human review.
Training a new challenger model on the production data that has a better representation of outlier data.
Document the issue and develop troubleshooting steps for addressing similar issues in the future.
Build fallback mechanisms or fail-safes.

Model-Level Functional Monitoring

Model/Concept Drift

Model drift, or concept drift, happens when the relationship between features and/or labels—in cases of supervised or unsupervised learning solutions—no longer holds because the learned relationship/patterns have changed over time. In other words, the model's predictive function no longer reflects reality because the “concept” the model learned has shifted. This is sometimes called model drift in terms of the model’s applicability. An example is a recommendation model that was trained on last year's user behaviour now performing poorly because user preferences changed this year. Concept drift is often detected by looking at output metrics. For example, if you monitor the distribution of prediction scores or classes (sometimes called prediction drift) and see a significant change, or if you observe that actual outcomes (ground truth) for a given input are changing such that the model’s predictions are consistently off-target. Monitoring concept drift might involve evaluating model performance on recent data (if labels can be obtained later) or using proxy metrics (like a drop in user engagement for a recommendation model, which might imply the recommendations are less relevant). Some examples include:

Public sentiment shifts over time in social services feedback.
A tax classification model no longer identifies recent scam patterns.

Some of the detection techniques that you can adopt to identify model drift include, but are not limited to:

Rolling evaluation on labeled data.
Monitor metrics like accuracy, precision, recall, F1, RMSE (if ground truth labels are available).

When identifying issues related to concept/model drift, some possible resolutions to adopt can include:

Regularly retrain on newer data.
Remodelling or redeveloping new models from scratch to better capture shifts in problem.
Consider online learning algorithms to automate the model updating process.

Model Configuration and Artifacts

The model configuration file and artifacts contain all the components that were used to build that model. Some of these components that are useful to monitor over the lifecycle of a deployed machine learning model include:

Training and evaluation datasets and versions
Hyperparameters
Dependency versions
Feature mappings
Environment variables
Code base for model processing, training, and testing.

When identifying issues related to model configuration and artifacts, some possible resolutions to adopt can include:

Use version control for config files.
Validate configs pre-deployment.

Model Versions

Monitoring model versions in production are critical as you want to be sure that the right version is deployed. Some of the best practices for monitoring model versions include:

Log all versions to a metadata store.
Tag predictions with model version.

Adversarial Attacks

Every organisation faces security threats and government agencies are no different. With machine learning applications increasingly becoming the central decision system of most companies and agencies, you have to be concerned about the security of your model in production as they are susceptible to adversarial attacks. Some examples include:

Adversarial attacks on image recognition in surveillance.
Fraudsters trying to fool a model for detecting suspicious transactions on a government website.

Some of the detection techniques that you can adopt to identify adversarial attacks include, but are not limited to:

Anomaly detection.
Input signature analysis.

When identifying issues related to adversarial attacks, some possible resolutions to adopt can include:

Use human-in-the-loop review.
Apply defenses from Adversarial Robustness Toolbox by Trusted AI.

Output-Level Monitoring (Predictions)

Monitoring model output in production is not just the best indicator of model performance, but it also tells us if the agency's KPIs are being met.

Model Evaluation Metrics

Using metrics to evaluate model performance is a big part of monitoring your model in production. Different metrics can be used here depending on your problem (e.g., classification, regression, clustering, anomaly detection, etc) and whether you have a ground truth/label to compare your model with:

Classification: Accuracy, Precision, Recall, F1-score, Receiver Operating Characteristic Curve, Precision-Recall Curve
Regression: Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, Mean Absolute Percentage Error, R-Squared
Anomaly Detection: Precision@k, Recall@k, F1@k, False Positive Rate, AUROC / AUPRC for Anomaly Scores
Bias and Fairness: Disparity metrics across demographics

You may refer to Chapter 5D: Model Development Evaluation for a more comprehensive discussion about these metrics and their mathematical formulations.

Prediction Drift

In the situation where the ground truth is not available for use, we can also use the prediction results distribution as a performance proxy to remain in line with the agency's KPI. A model evaluation store holds the response of the model (a signature of model decisions) to every piece of input data for every model version, in every environment. This way, you’ll be able to monitor model predictions over time and compare the distribution using statistical metrics

Some of the detection techniques that you can adopt to identify prediction drift include, but are not limited to:

Kullback-Leibler divergence
Kolmogorov-Smirnov statistics
Population Stability Index
Hellinger distance

When identifying issues related to prediction drift, some possible resolutions to adopt can include:

Use as early warning for model drift.
Combine with feature drift analysis.

Operational Monitoring

Monitoring at the operations and system level is primarily the responsibility of the IT, DevOps, or ML Engineering team. But, it also has to be a shared responsibility between the Data Scentist and the Ops team. When things go bad at this level, alerts are typically forwarded to the Ops team to act on; however, you might also be involved to resolve the issues depending on the alert raised. At this level, you’re mostly monitoring the resources your model runs on in production and making sure that they’re healthy. Resources such as pipeline health, system performance metrics (e.g., I/O, disk utilisation, memory and CPU usage, traffic), and cost are some of the key metrics you might want to consider monitoring.

System Performance Metrics

You need be aware of certain metrics that can give you an indication of how your model performs in line with the entire application stack. If your model has high latency in returning predictions, that is bound to affect the overall speed of the system. Some of the system performance metrics that you monitor related to your infrastructure include:

CPU/GPU utilisation
Memory consumption
Disk I/O
API latency (mean, p95, p99)
Throughput (requests/sec)
Failure rate (4xx/5xx errors)

When identifying issues related to system performance, some possible resolutions to adopt can include:

Auto-scale compute resources.
Optimise model code or caching.

System Reliability

System reliability involves monitoring the infrastructure and ensuring consistent availability and network uptime. Some of the metrics for monitoring system reliability include:

Cluster/node health
Deployment success/failure
Network latency and connectivity

Some of the detection techniques that you can adopt to monitor system reliability include, but are not limited to:

Heartbeat checks
Redundancy and failover mechanisms

When identifying issues related to system reliability, some possible resolutions to adopt can include:

Automate failovers
Enable chaos testing

Data Pipelines

Monitoring the health of data pipelines is extremely crucial because data quality issues can arise from bad or unhealthy data pipelines. This especially is extremely tricky to monitor for your IT Ops/DevOps team and may require empowering your Data Engineering/DataOps team to monitor and troubleshoot issues. Some of the metrics for monitoring data pipelines include:

DAG/task success rates
Data freshness
Schema evolution

When identifying issues within your data pipelines, some possible resolutions to adopt can include:

Data validation gates
Retry mechanisms and data versioning

Model Pipelines

You should also track crucial factors that can cause your model to break in production after retraining and being redeployed. Some of these factors include:

Job runtime and status
Dependency mismatches
Deployment rollbacks

When identifying issues within your model pipeline, some possible resolutions to adopt can include:

Use CI/CD tools like GitLab CI/CD
Log pipeline metadata

Cost Monitoring

You need to keep an eye out for how much it’s costing you and your agency to host your entire machine learning application, including data storage and compute costs, retraining, or other types of orchestrated jobs. These costs can add up fast, especially if they’re not being tracked. Also, it takes computational power for your models to make predictions for every request, so you also need to track inference costs. Some of these costs to monitor include:

Compute/inference cost per request
Storage usage
Cloud spend per model/service

When identifying issues with monitoring the cost of your projects, some possible resolutions to adopt can include:

Set budgets and alerts
Optimise model complexity

Concluding Remarks

Model observability is critical for sustainable and secure ML deployments in government settings. By proactively monitoring both functional and operational metrics, agencies can ensure high model performance, system reliability, and cost-effectiveness.

For a more detailed guide on monitoring ML models in production, visit: 👉 https://neptune.ai/blog/how-to-monitor-your-models-in-production-guide