Chapter 6D: Prometheus and Grafana for Model Monitoring and Observability

Introduction

Machine learning models in production require robust monitoring and observability to ensure they perform reliably and fairly over time. Prometheus and Grafana are an industry-standard open-source pairing for metrics-based monitoring and dashboarding, widely adopted across sectors from tech companies to government agencies. This chapter provides a comprehensive overview of Prometheus and Grafana in the context of MLOps, where we will cover the fundamentals of Prometheus and Grafana. We then discuss how these tools work together to monitor ML systems, followed by sharing the best practices and common pitfalls of using these tools.

Fundamentals of Prometheus

Prometheus is an open-source monitoring and alerting toolkit originally developed at SoundCloud and now part of the Cloud Native Computing Foundation. It excels at capturing and querying time-series metrics – numeric measurements recorded over time – from applications and infrastructure. Prometheus has been designed with a focus on reliability and simplicity in cloud-native, containerised environments. Let’s explore its key features, data model, metrics types, query language, and mechanisms for data collection.

Key Features

Prometheus provides several distinctive features for metrics monitoring. It uses a multi-dimensional data model where time-series are identified by a metric name plus arbitrary key-value labels (tags). This allows slicing and dicing metrics along different dimensions (e.g. by model version, data centre, user segment, etc.) when querying. Prometheus offers a powerful query language called PromQL for on-the-fly aggregation and analysis of these metrics. It operates as a standalone time-series database on each server (no external distributed storage needed), meaning each Prometheus server is autonomous and continues functioning even if network storage is unavailable. Metrics collection in Prometheus happens via a pull model over HTTP – the Prometheus server periodically scrapes each target endpoint for metrics. Prometheus can discover targets dynamically via integrations with service discovery systems or Kubernetes, as well as static configuration. Finally, Prometheus integrates with various visualisation and alerting tools; for example, it can feed metrics to Grafana dashboards, which will be covered later in this chapter.

Pull vs. Push Model

One of Prometheus’s core design decisions is using a pull-based data collection model. In Prometheus, each monitored service (whether it’s a web app, a database, or an ML model serving endpoint) exposes a HTTP endpoint where current metrics are reported in a simple text format. The Prometheus server then scrapes these endpoints at regular intervals, which can be configured. This approach contrasts with push-based systems where agents on the service push metrics to a central server.

The pull model has several practical advantages. It naturally detects if a target is down – if the Prometheus server cannot reach the endpoint, it knows the service is unavailable. In a push model, a silent target might go unnoticed for longer. Pulling also means you can easily run ad-hoc Prometheus instances (for example, on a developer’s laptop) to scrape metrics without reconfiguring the targets to send data. It centralises control of scraping intervals and parameters at the server, rather than requiring all applications to have logic for pushing metrics (which can simplify client code). Additionally, pull mode avoids the risk of “overwhelming” the aggregator – with push, a misconfigured client could flood the server; with pull, the server controls the rate. On the downside, pull requires that the server knows where all targets are (hence the need for service discovery) and that the server can reach them. In secured environments or across network boundaries, this may require additional configuration (Prometheus can scrape through proxies or you might co-locate Prometheus within each cluster to reach local metrics).

Under the hood, Prometheus stores all data as time-series: a sequence of timestamped samples (metric values) identified by a unique metric name and label set. Each unique combination of labels produces a separate time-series in the database. Prometheus’s storage is optimised for this data model: it uses an append-only log and periodic compaction to efficiently record new samples. Samples are indexed by metric name and labels for fast retrieval. The storage engine writes incoming samples to a write-ahead log (WAL) on disk and into an in-memory buffer, periodically compacting them into 2-hour segment files on disk. The WAL can be compressed to reduce disk I/O – enabling WAL compression can roughly halve its size with minimal CPU overhead. Prometheus’s on-disk footprint is very efficient, using on average only 1–2 bytes per sample due to compression. This means even millions of metrics per second can be stored given sufficient disk, but high-cardinality metrics (lots of unique series) are more of a scaling concern than raw sample frequency, as we’ll discuss later. If you need to keep data longer or at massive scale, Prometheus also supports remote storage integrations. It can forward data to cloud monitoring services for long-term retention and querying. In essence, Prometheus’s pull model and time-series database make it a self-contained, resilient system ideal for monitoring dynamic environments – you get immediate insight into whether your services (or ML models) are up and what their key metrics are doing, without heavy dependencies.

Metrics Taxonomy

Prometheus client libraries support four core metric types for instrumentation: Counter, Gauge, Histogram, and Summary. All of these ultimately get recorded as time-series in Prometheus (the server flattens them to numeric series), but they represent different semantic meanings.

A Counter is a cumulative metric that only increases (or resets). Use counters for quantities that monotonically increase, such as total requests served or number of errors encountered. For example, requests_total could be a counter that starts at 0 and increments for each request. If the process restarts, the counter may reset to 0, but it should never decrease during a single run. Counters are useful combined with PromQL functions like rate() or increase() to calculate rates over time (e.g., requests per second) or deltas. A best practice is not to use a Counter for values that can go down (like current concurrent users – that should be a Gauge).
A Gauge is a metric that represents a value that can go up and down arbitrarily. Use gauges for things like current memory usage, queue length, or any measurement that varies in both directions. Gauges sample the current state. For ML examples, you might use a gauge for “current number of active prediction jobs” or “GPU usage percent” or a real-time model score output. Gauges are straightforward but be careful if the value resets or has discontinuities (Prometheus assumes each sample is the current value at that time).
A Histogram samples observations into buckets, and is typically used to track distributions like request latency or response sizes. When you define a histogram (with a given name and bucket ranges), the Prometheus client actually exposes multiple metrics: a counter for each bucket (cumulative count of observations ≤ that bucket boundary), as well as the total observed value sum and count. Histograms are very useful for aggregating latency distributions across services or over time, which is important in SLA/SLO monitoring. One should choose bucket ranges carefully to cover the expected distribution of values.
A Summary is similar to a histogram in that it tracks distribution, but it calculates quantiles on the client side over a sliding time window. Summaries are straightforward for getting quantiles per instance, but they don’t aggregate easily (you generally cannot average or sum quantiles from multiple instances meaningfully). A common use of summary could be tracking something like model prediction confidence interval quantiles if you only care per-instance.

Prometheus metrics are exposed in a simple text format. For each metric type, you typically name it with a suffix or unit (e.g. _total for counters, _seconds for timing, etc.) by convention. This helps other users immediately grasp what type it is. Prometheus itself doesn’t require different handling of metric types at query time (all become numeric time-series), but knowing the type guides how you use them (e.g., apply rate() only to counters, not to gauges that can go down). The Prometheus docs and community strongly emphasize careful metric naming and labelling practices as part of best practices. For example, avoid embedding high-cardinality info in metric names or labels (next section), and follow conventions for units. Using the right metric type for each measurement in your ML pipeline will ensure the data is both correct and efficient to query – e.g., tracking “predictions made” as a counter makes it easy to alert if the rate of predictions drops to zero unexpectedly (meaning perhaps an outage in the model serving).

PromQL Basics and Relevance for ML Engineers

PromQL (Prometheus Query Language) is the powerful functional query language used to retrieve and aggregate metrics from Prometheus. PromQL allows you to filter time-series by labels, apply aggregations (sum, avg, min, max, count), apply functions (rates, quantiles, derivates, etc.), and even join metrics together in expressions. PromQL expressions can be used in Grafana graphs, alerting rules, or API calls to extract meaningful insights.

PromQL is central to how Prometheus enables alerting and visualisation. Alert rules are essentially PromQL expressions that trigger when they return a non-empty result (often a value above a threshold). Grafana also relies on PromQL for querying Prometheus data for dashboards. For ML engineers, understanding PromQL means you can self-serve a lot of monitoring logic: you can define custom metrics for your ML pipeline (e.g., data ingestion rate, feature skew, model accuracy on recent data) and then write PromQL queries to track them and alert on anomalies. It also supports arithmetic and comparison operators to combine metrics.

For ML monitoring, PromQL lets you do things like define SLOs (Service Level Objectives) and check them. For instance, you could have a metric for inference latency histogram and use PromQL to compute the percentage of requests under 300ms in the last 10 minutes; if it falls below some target, trigger an alert. Or use a combination of metrics to detect data drift – e.g., if the distribution of a feature value (exported as a histogram) deviates significantly from a baseline, a PromQL query could calculate the divergence. In practice, ML engineers don’t need to memorise every function, but should be comfortable reading and writing moderate complexity PromQL queries to harness the full power of Prometheus. The Prometheus UI and Grafana Explore allow experimenting with queries interactively. Grafana’s alerting now supports multidimensional alerts, meaning you can have a single PromQL alert rule that generates alerts labelled by model name or data centre, etc., based on the series that breach a condition.

Why does PromQL matter specifically for ML engineers? Because it enables custom monitoring beyond basic system metrics. Traditional Ops might focus on CPU, memory, request throughput. In ML, we care about things like model accuracy over time, feature statistics, input data volume, training duration, etc. By instrumenting those as metrics (counters, gauges, histograms) and using PromQL, an ML engineer can create tailored dashboards and alerts – for example: Alert if the daily average prediction confidence for model X falls below Y (potential model drift), or Alert if the distribution of predictions skews heavily compared to last week (possible data shift). PromQL can express these kinds of conditions by comparing current metrics to historical ranges or other metrics (via vector matching on labels).

In summary, PromQL is the query glue that makes Prometheus metrics actionable. It allows on-the-fly analysis of your ML system’s telemetry. Mastering even the basics (like filtering by labels, using rate(), sum by(), max, etc.) goes a long way for creating insightful monitoring dashboards and alerts for ML pipelines. It’s the “language of observability” for metrics that an ML team can use to understand their models in production.

Service Discovery and Scraping Mechanisms

Prometheus uses service discovery (SD) to find the targets it needs to scrape. Rather than manually maintaining a static list of endpoints (which is possible but tedious in dynamic environments), Prometheus integrates with many systems to automatically fetch lists of services. For example, in a Kubernetes cluster, Prometheus can query the Kubernetes API to find all Pods with certain annotations or all Services exposing a /metrics port. Other common service discovery mechanisms include AWS EC2 APIs, Azure or GCP service APIs, and more. There’s also file-based SD, where Prometheus watches a file for target lists (useful for custom integration), and DNS-based SD.

When Prometheus finds targets via service discovery, it will dynamically add or remove them from its scraping loop. This is crucial in environments like Kubernetes where pods (and thus endpoints) come and go frequently. It ensures new microservices or auto-scaled instances of an ML model are picked up automatically without human intervention. As an example, if you deploy 10 instances of a model serving service in a Kubernetes namespace, Prometheus (if configured with the proper selectors) will detect those 10 pods and scrape each one’s metrics endpoint, adding labels like pod name, namespace, etc., to distinguish their metrics.

Scraping is the process where Prometheus sends an HTTP GET request to each target’s metrics URL periodically. Targets typically expose metrics using Prometheus client libraries (e.g., a Python ML service might use the prometheus_client library to run an HTTP server thread for metrics). There are also many exporters – these are helper processes that translate metrics from third-party systems to Prometheus format (e.g., Node Exporter for Linux system metrics, or an exporter for database metrics). Exporters allow Prometheus to monitor systems that weren’t instrumented originally by providing a bridge that presents data on a /metrics endpoint. In a government context, for instance, an agency’s legacy Oracle database could be monitored by running an Oracle exporter which Prometheus then scrapes, instead of modifying the database itself.

Prometheus’s configuration file defines scrape jobs which group targets and set parameters like the scrape interval and any relabelling rules. The default scrape interval is 15s, but you may set some jobs (like very expensive metric endpoints or less critical metrics) to a longer interval (e.g., 1 minute) to reduce load. Conversely, for high-resolution needs, you might scrape certain metrics at 5s or 10s. Prometheus will manage concurrency of scrapes – typically it’s I/O bound, carefully not overloading the targets. If a target is slow to respond or has thousands of metrics, you might tune the scrape_timeout (default 10s) or adjust the target to expose fewer metrics or use a separate Prometheus for those heavy metrics.

Once scraped, metrics go into the TSDB (time-series database) with timestamps. If a target doesn’t respond, Prometheus marks that series as stale after a while so it won’t show false data. There’s also a concept of federation, where one Prometheus server can scrape metrics from another Prometheus – useful to create hierarchical architectures (e.g., each data center has a Prometheus, and a global Prometheus scrapes aggregated data from them).

In summary, Prometheus’s service discovery and scraping are what make it so convenient in cloud and container environments: you don’t have to babysit the target list. ML engineers deploying models on Kubernetes or VMs can rely on labels or tags to ensure Prometheus picks up their services. For example, you might label your Kubernetes Deployment with metrics=true and configure Prometheus to scrape all pods with that label. This kind of automation is in use at places like GDS (UK) – they built a multi-tenant Prometheus service on Cloud Foundry where a service broker automatically told Prometheus which apps to scrape. The net effect is a low-maintenance metrics collection system: as you spin up new ML model instances or services, they show up in your monitoring system without manual edits. Of course, one must also consider security (ensuring that only authorised Prometheus servers can scrape certain endpoints, possibly using TLS or authentication, as Prom supports basic auth and SSL). But by and large, service discovery plus pull scraping is a proven pattern that scales well in microservice architectures.

Limitations

Despite its strengths, Prometheus also has limitations that you, as the user, need to be aware of. As a single-node database, Prometheus does not natively cluster or replicate data – each instance is standalone and not horizontally scalable beyond certain throughput (though you can run multiple instances for sharding or high availability). This means if one Prometheus server fails, its data is not automatically replicated elsewhere (backups or federated setups are needed for durability). Prometheus prioritises timely metrics collection over absolute precision, so it may not be suitable for use cases requiring 100% accuracy or auditability (e.g. precise billing) – minor gaps or approximations can occur and the data is optimised for trend analysis rather than accounting. It’s not designed for long-term storage of years of data out-of-the-box (retention is typically days or weeks by default), so external storage or downsampling strategies are needed for very long retention. Prometheus also focuses on numeric time-series metrics; it isn’t a log or tracing system, so you’d complement it with tools like Elasticsearch for logs and traces. In summary, Prometheus works brilliantly for monitoring dynamic systems and generating alerts on metrics trends, but it trades off some precision and durability features common in centralised enterprise monitoring tools.

Fundamentals of Grafana

Grafana is a popular open-source visualisation and dashboard platform that is often paired with Prometheus (and many other data sources). While Prometheus stores and provides the metrics data, Grafana provides a rich, interactive web UI to create dashboards, charts, and alerts on top of those metrics. Grafana is data-source agnostic, meaning it can connect to a variety of backends: Prometheus, Graphite, InfluxDB, Elasticsearch, AWS CloudWatch, Azure Monitor, SQL databases, and more. This is very useful in practice because an MLOps stack might have metrics in Prometheus, logs in Elasticsearch, and perhaps business data in a SQL DB – Grafana can unify these into one dashboard view.

Let’s go through Grafana’s key features, how its dashboards work, and features like panels, variables, alerting, its approach to multi-tenancy and access control, and limitations.

Key Features

Grafana’s core strength lies in visualising time-series data through customisable dashboards. Some key features of Grafana include its extensive library of visualisation panel types (graphs, single-value gauges, tables, heatmaps, bar gauges, geo maps, etc), its support for interactive filtering and time-range controls, and a plugin ecosystem that extends it with even more panel types and data source integrations. Grafana dashboards are interactive – users can hover to see exact values, zoom in on time ranges, apply filters via variables, etc. Grafana also provides annotations (marking events on graphs) and the ability to mix different data sources on the same dashboard.

One standout feature is Grafana’s data source agnosticism and flexibility: it doesn’t store data itself (except some state about dashboards); rather it queries data on the fly from the connected sources. This means you can visualise metrics from Prometheus alongside, say, an annotation fetched from a PostgreSQL table (e.g., showing when a model was retrained) on the same chart. Grafana’s interface to each data source is through a plugin or native integration that converts the Grafana query editor inputs into a query that the backend understands (PromQL for Prom, SQL for databases, etc.). Grafana supports PromQL natively for Prometheus sources, making it easy for engineers to craft queries in the dashboard UI.

Grafana also provides unified alerting (only for newer versions) – you can define alert rules in Grafana itself, which are evaluated either by Grafana or by the data source’s capabilities. For Prometheus data, Grafana essentially can evaluate PromQL on a schedule to trigger alerts. Grafana’s alerting allows setting up multi-channel notifications (email, Slack, PagerDuty, etc.) through its Alertmanager-style system, and Grafana Cloud offers an OnCall integration to manage rotations.

Grafana’s key strengths have been summarised as: rich visualisations, interactive dashboards with variables, an extensive plugin ecosystem, and enterprise-ready features like granular permissions. It’s also popular because of its ease of use – a user with no coding background can use Grafana’s web UI to click together a dashboard, select metrics, and apply filters, which makes metrics accessible beyond just the Data Scientist or ML engineer (e.g., a product manager could view a Grafana dashboard of key ML service KPIs without needing to run queries themselves).

Data-Source Agnostic Dashboards

Grafana’s dashboards are data-source agnostic and reusable. This means a single dashboard JSON definition can be pointed at different environments or data sources easily. For example, you can create a dashboard for “Model Service Metrics” and have a variable for the data source, allowing you to switch between a Prometheus in dev, staging, or production clusters using the same dashboard template. Grafana supports data source variables – a special kind of template variable that lets you pick a data source on the fly. This is extremely useful if, say, each government agency or each department has its own Prometheus – you could build one dashboard and then simply switch the data source drop-down to view Department A vs Department B metrics (assuming they have similar metrics exposed). Similarly, environment toggles (prod vs test) can be implemented either as separate data sources or via label filters.

Because Grafana is not tied to Prometheus alone, you can also combine data: a single dashboard could show panels from Prometheus (like system metrics) and panels from, say, Elasticsearch (like count of error logs) side by side. Grafana’s plugin ecosystem includes official support for cloud monitoring (CloudWatch, Azure Monitor, Google Cloud Monitoring), databases (Postgres, MySQL), and even CSV files or JSON API endpoints (via community plugins). This polyglot ability means your ML team can have one observability portal for all metrics and metadata. For instance, you might plot model accuracy from a Prometheus metric and plot the number of support tickets from a SQL database to see if there’s a correlation – Grafana enables that unified view.

Grafana dashboards are defined in JSON, which can be exported and imported easily. This means you can version-control dashboards and templatise them. This saves time and ensures some consistency. Agencies can share Grafana dashboards for things like COVID-19 stats or website analytics publicly, demonstrating how a single JSON can be reused across instances.

To keep dashboards data-source agnostic, one should avoid hard-coding anything that’s environment-specific. Instead, use Grafana templating (variables) for dynamic segments (more on that next) and use relative time (like “now-24h” as a range) to keep it broadly applicable. Dashboards can also be shared externally in a snapshot form (which strips sensitive data and just shows the chart images) – useful if an agency team wants to publish a dashboard externally for transparency. Grafana supports one-click snapshot sharing (either hosted on Grafana’s free service or self-hosted). This has been used for things like publicly sharing COVID metrics dashboards or city transport dashboards, etc., where the snapshot is static and doesn’t require giving direct access to the live Grafana.

Overall, Grafana’s agnostic approach means it’s a one-stop interface for different telemetry – very valuable in MLOps where you have metrics from the model, from the infrastructure, and maybe from user behaviour all needing correlation. Grafana shines in providing a single dashboard that can pull from all relevant sources.

Panels, Variables, Templating, and Alerting Rules

A Grafana dashboard is composed of one or more panels. Each panel is a visualisation (graph, chart, table, etc.) with a query behind it. For example, one panel might show a time-series graph of “GPU usage”, another might show a single statistic of “Real-time model accuracy”, another might show a table of “Top N features by importance” for model interpretability. Grafana’s UI also allows arranging panels in a grid. You can set per-panel settings like thresholds (to color values as green/yellow/red), units (so that 0.85 becomes “85%” if you set it as percentage unit), and transformation (like converting seconds to hours, etc.).

Variables (also called template variables) are a powerful feature that make dashboards dynamic and reusable. You can define a variable (e.g., $model) that is filled by a query – for Prometheus, this could be something like label_values(model_name) to get all distinct model names from the metric data. Then in your panel queries, you use $model as a placeholder. The dashboard will show a drop-down allowing the user to select a model name, and all panels will update to filter to that model. Variables can also be chained (one variable’s selection can filter the options of another). For example, you could have $region and $model where $model query depends on the chosen region. This templating is crucial for avoiding “dashboard sprawl” – you don’t need a separate dashboard for each model or each server; one dashboard with variables can cover all, just by selection.

Grafana supports variables for many use cases: data source selection, dynamic intervals, custom key/value mappings, etc. In an ML monitoring scenario, you might have variables for model version, data center, client application, etc. So a single “Model Performance Dashboard” can quickly pivot between different models or versions. This is something the AWS Grafana guide calls out as best practice – use template variables instead of copying dashboards for each instance. It not only reduces maintenance, but as a user it allows quick comparisons (some Grafana panels even support multi-value variables so you can overlay multiple selections on one graph).

Alerting rules in Grafana allow you to turn any panel query (or a custom query) into a continuous check that can trigger notifications. In newer versions of Grafana, alerting was unified so that you define an alert rule with one or more queries and conditions (e.g., query A = some metric, then condition “WHEN query A is above 0.9 for 5 minutes”). Grafana will evaluate these on a schedule using its built-in scheduler or via an Alertmanager. Grafana’s alerting supports sending alerts to contact points (email, Slack, Teams, etc.) and you can group and silence alerts similarly to Prom Alertmanager. One advantage: Grafana can alert on any data source, not just Prometheus. So you could have an alert if, say, a value in a SQL database crosses a threshold – something Prom by itself wouldn’t do.

However, for metrics that are already in Prometheus, many teams keep the alert definitions in Prometheus’s Alertmanager ecosystem, as it’s quite mature. But if your agency prefers a single interface, Grafana’s alerting can consolidate alert management for metrics, logs, etc., in one UI. One thing to be mindful of: avoid duplicating the same alert in both systems to not confuse responses.

In an ML context, consider an alert for data drift: you might have a PromQL query computing a distribution difference. Grafana could continuously run this and alert the ML engineers if drift exceeds a threshold, including perhaps a static image of the panel in the alert (Grafana can attach an image of the panel showing the metric, if image rendering is set up). This is very useful for quick triage – e.g., an email alert about “Model drift high for Model X” could show a mini graph.

Grafana panels also allow drill-down links and navigation. You can add links on dashboards to other dashboards or external URLs. For instance, clicking on a specific model’s panel might navigate to a detailed dashboard for that model. This helps create a logical structure – maybe a top-level overview dashboard with high-level metrics for all models, and then links to per-model deep dives.

Finally, Grafana supports report generation in its Enterprise version – you can schedule periodic PDF exports of dashboards to be emailed. For example, a weekly report to stakeholders about model performance can be automated. The image rendering plugin (or remote rendering service) is required to generate such PDFs or images. In open source Grafana, you can still manually export a dashboard as PDF or PNG by using the print or snapshot features, but scheduled reporting is enterprise-only. Some public sector teams use this to email the senior management team a PDF of key metrics each morning.

In summary, Grafana’s panels and templating let you present metrics in a digestible way for different audiences, while its alerting and reporting features help ensure people are notified when something goes off track in the ML system. It turns raw Prometheus data into actionable insights – combining graphs, legends, annotation text, and alerts.

Role-Based Access Control (RBAC) and Multi-Tenancy for ML Teams

In multi-team or multi-tenant environments (which are common in large organisations and government agencies where many teams might share a Grafana), controlling who can see or edit what is important. Grafana addresses this through organisations, teams, roles, and (in Enterprise) fine-grained RBAC).

Grafana Organisations: In open source Grafana, you can have multiple orgs within one Grafana instance. Each org is a completely isolated space with its own dashboards, data sources, and users. For example, one could set up an org for “Dept of Transportation” and another for “Dept of Health” on a shared Grafana, and users in one org cannot see anything in the other. Organisations provide hard multi-tenancy, but it can become a bit unwieldy to manage many orgs. Users can belong to multiple orgs if needed (with potentially different roles in each).

Roles: Grafana’s basic roles are Viewer, Editor, and Admin. A Viewer can see dashboards but not save changes. An Editor can create and modify dashboards. An Admin can also manage data sources, users, and org settings.

Grafana Enterprise (and Grafana Cloud) add fine-grained RBAC: you can define custom roles with specific permissions (like a role that can edit dashboards but not create data sources, etc.), and you can also assign permissions at the folder or dashboard level. Granular RBAC also supports restricting data sources (so one team cannot even query another team’s Prometheus). This is crucial when one Grafana instance serves multiple unrelated tenants (like a managed service provider scenario). For internal use, many teams get by with just separate folders and using naming conventions to avoid confusion.

Teams: Grafana allows grouping users into teams for easier permission management. For example, a “Data Science Team” could be given Editor access to certain folders. Teams also tie into notification channels for alerting (you can set alerts to notify a Grafana team which then has multiple members’ contacts).

Multi-tenancy models: Aside from Grafana’s built-in org separation, another approach if using Prometheus is to actually shard the data: e.g., each ML team runs their own Prometheus and maybe their own Grafana. This is simpler from a security viewpoint (complete isolation) but then you don’t get cross-team visibility and you maintain multiple instances. Grafana’s docs mention that using multiple orgs is easier/cheaper than multiple full instances, since you share the server. It depends on trust and use case.

For ML teams in a large org or government, one might use one Grafana with RBAC such that each team can only edit their own dashboards but maybe view common operational dashboards. Grafana supports setting folder permissions to achieve this (e.g., only Team X has access to Folder X).

Within a team, you also want to manage changes to dashboards carefully (especially in production monitoring). Best practice is to use version control for dashboards – Grafana allows exporting JSON models and you can store them in Git. Grafana Enterprise has a concept of Dashboard Git sync, or you can use the provisioning feature to manage dashboards as code. This way, changes can be reviewed. Otherwise, there’s the risk of someone accidentally deleting a dashboard or altering it in a way that hides an important metric.

In government agencies, compliance requirements often demand strict access control. Audit logs in Enterprise Grafana can track who viewed what (if needed). For agencies dealing with sensitive ML data (e.g., personal health information dashboards), one would typically isolate those in separate orgs or even separate Grafana deployments.

To sum up, Grafana provides flexible mechanisms to support multi-user environments:

Use folders and teams to prevent dashboard sprawl and ensure users focus on relevant content.
Use orgs or separate instances if data needs hard isolation.
Leverage RBAC (in enterprise) for fine control, e.g., allow an intern to view but not edit certain dashboards, etc..
Because Grafana is often a window into critical operational data, applying the principle of least privilege is wise (only grant editing rights to those who truly need it).

Limitations

In terms of limitations, Grafana does not handle data retention or heavy computations itself – it relies on the data source (Prometheus, etc.) for performance. This means if you have extremely large volumes of data, Grafana might be slow to draw graphs unless you’ve tuned the data source or used features like query caching. Grafana Enterprise and Cloud offer a query caching feature that can store query results for a short time to accelerate dashboards where the same query is repeated frequently. This can significantly reduce load on Prometheus for very expensive or large queries, at the cost of slight staleness. However, not all queries benefit (Prometheus itself has some caching of recent queries in memory, and often the freshness of data is preferred).

Another limitation is alerting duplication: if using both Prometheus Alertmanager and Grafana’s alerting, you might end up managing two systems. Many organisations stick to one to avoid confusion (though Grafana can actually hook into Prometheus Alertmanager as well). For MLOps, where custom alert logic may be needed (like detecting data drift), sometimes writing that alert in Python or using a specialised tool might be considered; but increasingly one can encode a lot of logic in PromQL and thus use Grafana or Prometheus alerting systems.

In summary, Grafana is a powerful visualisation layer but it assumes your underlying data is accessible and reasonably aggregated. It doesn’t, for example, do heavy-duty analytics joins between disparate data sources. There is a feature called Grafana Loki (for logs) and Grafana Mimir/Tempo (for metrics at scale and tracing respectively) which are separate backend components under Grafana Labs – but the Grafana UI brings them together. For everyday use, Grafana’s limitations are few: as long as you don’t try to use it as a replacement for things like Tableau for large static business reports, and you keep in mind it’s optimised for time-series and operational dashboards, it’s extremely versatile.

How Prometheus and Grafana Come Together to Achieve MLOps Monitoring and System Observability

Prometheus and Grafana are often deployed as a pair, creating a complete monitoring solution: Prometheus scrapes and stores metrics, and Grafana queries those metrics to visualise and alert on them. In the context of MLOps, this combination provides end-to-end visibility into both the infrastructure running ML workloads and the ML-specific metrics (like model performance, data quality metrics, etc.).

Prometheus + Grafana workflow: You instrument your ML systems (model serving, data pipeline, training jobs, etc.) with Prometheus client libraries to expose metrics. Grafana is configured with Prometheus as a data source. Then you build Grafana dashboards to observe things like: model request rates, latency percentiles, throughput of data ingestion, number of predictions made, accuracy or error rates, etc. You also set up alerts – some in Prometheus (which sends to Alertmanager then maybe to a notification system) and/or in Grafana directly – for conditions like “model error rate above X” or “no predictions in last 10 minutes” (indicative of a stalled service).

Grafana and Prometheus together provide observability across the three pillars for ML systems:

Metrics – capturing numeric signals of system behaviour.
Logs – for detailed error context, accessible through Grafana.
Traces – for end-to-end request flows, viewable in Grafana.

Prometheus and Grafana together essentially serve as the nervous system and control centre of an MLOps pipeline. They not only help detect issues (like data pipeline stalls, model degradations, service downtime) but also provide the historical context to analyse them. For example, if a model’s performance dips every weekend, Grafana’s graphs will reveal that pattern, prompting investigation into a possible weekly data drift or user behaviour change.

In practice, using Prometheus and Grafana together leads to rapid feedback loops. This tight monitoring loop is fundamental to continuous delivery in ML systems, where data or model issues can be subtle but metrics will surface them (e.g., a slight increase in inference time or memory usage might hint at an inefficient model).

In sum, Prometheus and Grafana form a synergistic pair: Prometheus provides the data and alerting engine, Grafana provides the visualisation and user interface. Together they are greater than the sum of parts, enabling not just monitoring but true observability – the ability to ask arbitrary questions about your system's behaviour (via PromQL queries in Grafana) and get answers visually. For any ML system in production – whether it’s a predictive model for public policy, a recommendation system for a citizen portal, or a computer vision pipeline for traffic management – this duo is often the go-to solution for keeping an eye on things and ensuring the system meets its service level objectives.

Best Practices and Common Pitfalls when using Prometheus + Grafana

To ensure a successful and maintainable monitoring setup, it’s helpful to follow best practices and avoid known pitfalls. This section goes over some of the key dos and don’ts, particularly focusing on issues like over-instrumentation (too many or too granular metrics), separating experimental vs. production monitoring, managing dashboard sprawl, and involving humans appropriately for complex ML alerts (like drift or bias detection).

Avoid Over-Instrumentation and Metric Cardinality Explosions

One of Prometheus’s few pain points is high-cardinality metrics – metrics with many possible label values. This is called a “cardinality explosion” and can lead to Prometheus using huge amounts of RAM or even crashing (because it tries to keep track of all these series). A famous guideline from Prometheus developers is avoid labels that can have more than 10 (or at most low hundreds) of unique values unless absolutely needed. They often say, “imagine each unique label combination is like a new time-series in memory – budget accordingly.”

Best practices to avoid this pitfall:

Don’t label metrics with IDs or timestamps. For instance, do not do request_duration_seconds{request_id="XYZ123"} – that’s essentially a unique series per request, defeating the point of metrics (that’s what logs or traces are for). Instead, expose an aggregate metric like latency distribution.
Limit dimensionality. If you have too many label dimensions, the combination can explode. For example, metric with 5 labels each having 10 values -> up to 100k series if independent. Pick the important ones (maybe 2-3 dimensions).
Use hierarchical labels wisely. E.g., labelling by cluster and instance is normal, but labelling by instance and pod_name and container_id and IP might be redundant.
Aggregate high-card events externally. If you need something like per-user stats, consider pushing those to a database or use logs. Or aggregate into buckets. For example, instead of labelling each product_id in a metric, maybe label by product category. Or maintain a top-N products metric.
Monitor your cardinality. Prometheus has metrics about number of series per metric name. If one metric is skyrocketing, investigate if label usage is too granular.
Functional sharding of metrics. If one service legitimately has high cardinality needs (e.g., maybe an ML model with thousands of output labels, each with a metric), consider isolating that to its own Prom or summarising it. Or use histogram to encode many values in one metric rather than separate.
Drop unnecessary labels at scrape time. Prom’s scrape config can drop labels via relabelling.

Over-instrumentation isn’t just cardinality, it can also mean measuring too many things. It’s tempting for developers to add a metric for every possible internal statistic. But more metrics means more overhead and often a lot of them are not actionable. Focus on key metrics that are tied to SLIs/SLOs or known failure modes. In ML monitoring, it might be crucial to track overall prediction count and error rate, but maybe you don’t need a metric for every internal layer of the neural network’s timing (unless debugging specific issues). If you do want super fine metrics for debugging, consider gating them (e.g., enable them only in dev environment or have them off by default).

A common pattern to avoid: labelling metrics with free-form text (like error message). That will create a new series for each distinct text, which results in cardinality hell. Instead, use a label for error type and leave details to logs.

Symptoms of cardinality problems: Prometheus memory usage spikes unexpectedly, slow queries (especially ones using regex on a high-card label), long startup times, or some metrics not being ingested (if hitting series limits). If that happens, identify and cut down those metrics.

In summary, be intentional with metrics. More isn’t always better – the signal-to-noise ratio matters. And high cardinality is the arch-enemy of Prometheus scalability. By following naming conventions and reviewing metrics design (Prom’s documentation on instrumentation provides guidance), you can avoid the cardinality trap and keep the monitoring system healthy.

Separate Experimentation vs. Production Monitoring Stacks

In MLOps, there is often a clear distinction between production systems (live models serving real users or critical decisions) and experimental or development systems (model training experiments, offline evaluation, etc.). It is a best practice to separate the monitoring of these environments to some degree, so that noisy experimental metrics don’t interfere with or confuse the production monitoring, and likewise so that issues in dev don’t page production on-call engineers.

Why separate?

Stability: Production monitoring needs high reliability and signal-to-noise. If every time a data scientist runs a new training job with dozens of metrics, those feed into the same Prometheus/Grafana as prod, it could flood it with temporary series (violating the above cardinality rule) or trigger alerts that are irrelevant to end-user service quality.
Security/Access: Production data might be sensitive; you might not want every developer to see all prod dashboards. Conversely, dev environment might have less stringent access control – mixing them could create loopholes.
Performance: As mentioned, experiments can produce lots of metrics. For example, an AutoML run might log metrics for each parameter combination – which you wouldn’t want cluttering prod Prometheus.
Logical clarity: It’s easier to reason about production metrics when they are not cluttered with dev services. If an alert fires, you know it’s prod. If experiments crash (which they do often), you don’t want your prod incident response team wading through those.

How to separate:

Use different Prometheus instances or servers for prod vs non-prod. Many teams run one Prometheus for production cluster(s) and a separate one for staging/dev cluster. They might configure Grafana with multiple data sources (Prod Prom, Dev Prom). This way, production dashboards point to Prod Prom data only. Dev Prom can be less beefy and allowed to occasionally fail without big consequence.
Use naming or labelling conventions to distinguish environment. For instance, include a label env="prod" or env="dev" on all metrics (Prometheus can attach a global label via job config). Then, ensure production alerts/dashboards filter on env="prod". This is a safety net if you do use one shared Prom. But usually separate is better.
Dedicated Grafana or org for prod. In some cases, you might even have a separate Grafana instance (or at least separate organisation in Grafana) for production monitoring, to tightly control access and reduce clutter. Dev Grafana might have more experimental dashboards, while Prod Grafana is curated.
Alert routing: If by chance some dev metrics are in the same alerting system, ensure alerts from dev either don’t page or go to a different receiver.

The idea of separation is akin to how software moves through environments: you wouldn’t let a test environment failure set off the same alarms as production. Apply that concept to ML pipelines and metrics too.

For example, an ML team might have a “model training cluster” where they use Prometheus to track GPU usage, training loss, etc., but they intentionally keep that Prometheus separate from the production inference Prometheus. If the training cluster’s Prometheus crashes due to heavy load, it doesn’t impact production monitoring. Also, the training cluster may have very different metrics (maybe per-experiment metrics, which they’ll analyse after the fact rather than needing real-time alerting).

From a Grafana perspective, you could have unified Grafana that simply has two sets of dashboards – some with data source = Prod Prom, others = Dev Prom. That’s fine if properly labeled. Alternatively, run two Grafana instances – one locked down for Prod (with alerting configured to call on-call) and one for Dev (alerting maybe off or just notifications to team slack).

Another angle: Feature experimentation vs model monitoring. If you do a lot of A/B tests or shadow deployments, you may treat those model instances as production (if they serve real traffic, even if not all users). But if you have an experimentation environment where you try new models on test data, definitely separate.

Some teams incorporate dev vs prod context as part of dashboard templating – e.g., a variable to select environment. That works if you trust everyone to select the right one, but for critical monitors, better to have separate fixed dashboards to avoid confusion.

Pitfall to avoid: mixing metrics that have wildly different scales or behaviours without clarity. For instance, if dev environment has regularly failing jobs (which is normal during experimentation) and those emit error metrics, you don’t want those to raise the global “system errors” alert. One approach is using distinct metric names or label for environment to scope alerts. But simpler: don’t feed dev metrics into production alert evaluation at all.

In summary, isolate your “chaos” (experimentation naturally is chaotic and exploratory) from your “order” (production which needs stable monitoring). This protects the reliability of production monitoring and ensures focus – on-call can trust that if an alert fires, it’s production impacting, not just an experiment. It also gives experimenters freedom to instrument wildly or test new metrics without fear of breaking prod monitoring (with the caveat that if they share any infra, they should still not do cardinality explosions that crash a shared Prom server – another reason to give them a separate Prom).

Keeping Dashboard Sprawl Under Control

“Dashboard sprawl” refers to the proliferation of too many Grafana dashboards, often with overlapping or outdated information. This can happen in any organisation: every time someone needs a new view, they create a new dashboard, and old ones are rarely deleted. Over time you end up with hundreds of dashboards – some named poorly, many duplicates, and it becomes hard to know which dashboard is the canonical source for a given info. This wastes time and can also tax Grafana unnecessarily.

Best practices to avoid and manage sprawl:

Adopt a naming convention and folder structure. For instance, organise by team or system: a folder for “Production – Service A”, one for “Production – Service B”, one for “Infra – Database”, one for “Experimental”, etc. Within folders, name dashboards clearly with what they cover. This makes browsing easier (rather than one giant General folder with 100 dashboards).
Use template variables instead of copies. If you have many similar entities (like microservices, or ML models), try to have one dashboard with a variable for the entity, rather than one per entity. For example, instead of “Model X dashboard” and “Model Y dashboard” that are identical except for queries, make one “ML Model Dashboard” with a variable for model name. This single dashboard can serve both X and Y by selection. That eliminates duplicates and ensures improvements to the dashboard apply to all models.
Regularly prune and archive dashboards. Set a schedule, maybe quarterly, to review dashboards: which haven’t been viewed in months? Which refer to decommissioned systems? Grafana doesn’t have an automatic cleanup, so this is a manual process. However, having metrics for Grafana usage can help find stale ones.
Control who can create dashboards in production. If every engineer can freely create dashboards in the production org, they often will. Consider allowing Editors to create in a “Scratchpad” folder, but have a smaller set of maintainers curate the official dashboards in main folders. Or encourage using Grafana’s snapshots for one-offs visualisations.
Documentation of dashboards. Create a README or index dashboard that lists key dashboards and their purpose. This helps newcomers know where to look. It also can be a place to mark obsolete dashboards.
Version control / Infrastructure as Code. If you manage dashboards via code (Grafana provisioning or CI pipeline), sprawl is naturally constrained because changes go through code review. It’s easier to spot duplicates or to ensure removal of old ones as part of code updates.
Prevent needless forks. Sometimes two teams want slightly different views and they fork a dashboard. Try to unify their needs with one dashboard that uses variables or conditionals. Or if one team’s view is a strict subset of another’s, perhaps one dashboard can do both (with collapsible rows or links).
Leverage Dashboard Links and Library Panels. Grafana now has the concept of library panels and you can link dashboards. Using these features can reduce making separate dashboards for things that could just be a click away within one workflow.

Why is sprawl particularly relevant in MLOps? ML systems can be complex with many components – data ingestion, feature store, training, serving, etc. It’s easy to make separate dashboards for each component and sub-component. But if overdone, you might have 50+ dashboards and people don’t know where to look when an issue arises (time wasted = longer incidents). Consolidating related metrics into fewer logical dashboards means when an alert hits, you might just have to check one or two dashboards, not hunt through many. Additionally, excessive dashboards can lead to inconsistent monitoring. For example, if one dashboard’s threshold for “high latency” is different from another’s, teams might get conflicting signals. Better to have a single source of truth.

In summary, treat dashboards as code/artifacts that need maintenance. Fewer, well-maintained dashboards > many neglected ones. It’s an often-quoted principle that observability is about understanding, not just lots of charts. Each dashboard should have a clear purpose. If two dashboards serve the same purpose, merge them. If one serves no purpose anymore, eliminate it. This ensures that when something goes wrong at 2am, the on-call doesn’t have to guess which of the 10 similar dashboards has the info they need, or worse, not even realise a better dashboard existed.

Human-in-the-Loop for Drift and Bias Alerts

Automated monitoring is fantastic for catching technical issues (latency spikes, error rates, etc.), but when it comes to machine learning-specific issues like model drift or bias detection, a best practice is to keep a human in the loop for evaluation and response. Why? Because detecting drift or bias can be complex and context-dependent – an alert might indicate a potential problem, but deciding if it’s truly a problem and what to do often requires human judgment (at least with current tools).

Drift alerts: Data drift or concept drift occurs when the statistical properties of input data or the relationship to outputs changes over time. You can set up metrics to measure drift (e.g., KL divergence between recent feature distribution and training distribution, or drop in model accuracy on a rolling window). If such a metric crosses a threshold, that can trigger an alert. However, that alert doesn’t necessarily mean immediate emergency like a system outage would. It flags that the model might be becoming stale or less trustworthy. A human (ML engineer or data scientist) should review: maybe the drift is due to a temporary anomaly in data that will revert, or maybe it’s a real shift requiring model retraining.

Thus, drift alerts are often configured to notify a team (perhaps via email or a ticket) rather than page at midnight. The alert message should include relevant info (which feature drifted, magnitude, etc.). The human then looks at dashboards, perhaps does additional analysis to confirm drift’s impact.

Bias alerts: Similar to drift, bias metrics (like model performance disparity across demographics) might be monitored. If a metric indicates growing bias (say the model’s error rate for one group is now significantly higher than for others beyond a threshold), you’d alert the ML team. But the resolution might be nuanced: gather more data for that group, retrain, or even decide if the threshold was too strict. A human needs to investigate if it’s a metric artifact or a real bias issue.

Why human oversight is important:

ML metrics often require interpretation. A spike in CPU has an obvious meaning (overload) and solution (scale up). A spike in KL divergence of input data needs context – which feature? Does it matter? Is the model still performing okay? Only a human can answer that fully.
Avoid automated actions on potentially noisy signals. If you automatically retrained a model on detecting drift, you might retrain on bad data and make things worse.
Ethical considerations: For bias, definitely involve humans. You wouldn't automatically disable a model because a bias metric tripped without human confirmation, as it might affect service availability vs fairness trade-offs that need discussion.

How to integrate human-in-loop:

Use Grafana to send such alerts to a human-friendly channel like an email with charts, or a Slack message to a team channel.
Provide context in alerts or dashboards. Clear guidance ensures the human knows how to act.
Possibly schedule periodic human reviews of models even if no alert fired. For instance, a monthly report (via Grafana reporting or manual analysis) on model performance and drift, to be reviewed by an ML team.

Therefore, treat ML model quality alerts differently from traditional uptime alerts. Ensure that responsibility is assigned – e.g., the ML engineer on-call during business hours reviews any drift alerts. Some organisations have a separate "Model accuracy on-call" rotation or include it in data scientist responsibilities, not the same ops team that handles infra. That specialisation ensures the person responding has the skills to interpret the alert.

In summary, while Prometheus and Grafana can automate detection of model performance issues, they should augment, not replace, human judgment in these areas. The best practice is to incorporate workflows where humans regularly inspect model metrics (with help from Grafana dashboards and alerts as cues) and make informed decisions on retraining, model updates, or investigation of data issues. This reduces risk of automated pipelines doing the wrong thing in complex scenarios that require domain understanding and prevents alert fatigue by not firing off dozens of confusing alerts that no one knows how to handle – instead, a few well-directed alerts go to the right experts who then intervene.

Concluding Remarks

Prometheus and Grafana have proven to be a powerful combination for monitoring modern machine learning systems, offering transparency and control over both infrastructure and model performance. In this chapter, we explored how Prometheus provides a robust foundation for metrics collection with its flexible data model and query language, and how Grafana builds on that to deliver intuitive visualisations and collaboration through dashboards. For MLOps specifically, Prometheus and Grafana enable teams to not only keep services running (through classic uptime and resource monitoring) but also to ensure the ML models themselves remain healthy and relevant. By tracking metrics for data drift, prediction accuracy, and bias, and bringing those into the same observability platform, organisations and agencies can catch subtle issues in models before they escalate – all while maintaining the reliability of the surrounding application. In conclusion, mastering Prometheus and Grafana in the context of MLOps gives you the capability to not only detect problems but to gain confidence in your ML systems. When deployed thoughtfully, these tools help ensure that your models are delivering value reliably, that you can respond quickly when they don’t, and that you maintain public trust in automated decisions by monitoring for unintended behaviour.

Useful References for Additional Reading: