Chapter 5D: Model Evaluation

Effective model evaluation is critical for validating the performance and reliability of machine learning solutions. By carefully selecting evaluation metrics, you gain insights into how well your model will generalize and whether it meets the project's objectives. In this chapter, we’ll explore key metrics for classification, regression, anomaly detection, and clustering, highlighting the mathematical formulation, pros and cons, and assumptions. We will also provide guidance on baseline heuristics, fairness and bias considerations, and code examples on how to integrate these metrics into an AWS SageMaker pipeline.

5D.1.Classification Metrics

Let’s denote:
- \(TP\) = True Positives, \(TN\) = True Negatives, \(FP\) = False Positives, \(FN\) = False Negatives

Accuracy

\[ \text{Accuracy} = \frac{\text{\(TP\) + \(TN\)}}{\text{\(TP\) + \(TN\) + \(FP\) + \(FN\)}} \]

Measures the proportion of all predictions that a model predicts correctly. It answers the question: "Out of every instance the model evaluated, what fraction did it classify correctly?"

Pros

Easy to interpret.
Useful when class distribution is reasonably balanced.

Cons

Can be misleading if classes are heavily imbalanced (e.g., a 99% negative class).

Assumptions

Typically assumes balanced or near-balanced classes, otherwise it may obscure minority class performance.

Precision, Recall, and F1 Score

Precision

\[ \text{Precision} = \frac{TP}{TP + FP} \]

Measures the proportion of predicted positive instances that are actually positive. It answers the question: "When the model predicts a positive, how often is it actually correct?"

Recall (also called Sensitivity or True-Positive Rate)

\[ \text{Recall} = \frac{TP}{TP + FN} \]

Measures the proportion of actual positive cases that a model correctly identifies. It answers the question: "Of all the real positives in the data, how many did the model capture?"

F1 Score

\[ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]

Measures the harmonic mean of Precision and Recall, providing a single metric that balances both of these metrics. It ranges from 0 (worst) to 1 (best).

Pros

Useful for imbalanced classification.
Capture trade-offs between false positives and false negatives.

Cons

Might not be intuitive to non-technical stakeholders.
F1 loses clarity on whether precision or recall is being prioritized.

Assumptions

Balanced evaluation of false positives and false negatives when using F1.
If one error type is more critical, consider focusing on Precision or Recall specifically.

ROC AUC (Receiver Operating Characteristic Curve) and PR AUC (Precision-Recall Curve)

ROC AUC

Plots True Positive Rate (Recall) vs. False Positive Rate at various classification thresholds.
Summarizes model’s discrimination ability (range: 0.5–1.0).

PR AUC

Plots Precision vs. Recall across thresholds.
Better indicator of performance when data is significantly imbalanced.

Pros

Both of these metrics are threshold-independent.
Provide a holistic view of the trade-off between sensitivity and specificity (ROC) or precision and recall (PR).

Cons

ROC can be overly optimistic with highly skewed data distributions.
PR curves can become difficult to interpret if the positive class is extremely rare.

Assumptions

Work best with a wide range of threshold values.
More relevant for binary classification tasks.

5D.2. Regression Metrics

Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)

\[ \text{MSE} = \frac{1}{N}\sum_{i=1}^{N} \bigl (\hat{y}_i - y_i\bigr)^2 \]

\[ \text{RMSE} = \sqrt{\text{MSE}} \]

Measures the average square difference between predicted values and the true values. In other words, MSE measures how far predictions deviate from actual targets, penalising larger errors more heavily because the differences are squared. A lower MSE indicates better predictive accuracy for regression models.

Pros

Heavily penalizes large errors.
RMSE is in the same units as the target variable.

Cons

Sensitive to outliers due to squaring.
May not be suitable if large outliers are frequent but tolerable.

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{N}\sum_{i=1}^{N} \lvert \hat{y}_i - y_i \rvert \]

Measures the average of the absolute differences between predicted values and true values. In other words, it expresses (in the same units as the target variable) how much predictions deviate from actual outcomes on average, treating all errors equally without disproportionally penalising larger errors.

Pros

Less sensitive to outliers compared to MSE.
Easy to interpret (absolute difference in the target’s units).

Cons

Does not square errors, so large errors don’t get penalized more strongly than smaller ones.

R-Squared (\(R^2\))

\[ R^2 = 1 - \frac{\sum_{i=1}^{N}\bigl(\hat{y}_i - y_i\bigr)^2}{\sum_{i=1}^{N}\bigl(y_i - \bar{y}\bigr)^2} \]

Measures the proportion of the variance in the target variable that is explained by the model's predictions. Values range from 0 (no better than predicting the mean) to 1 (perfect fit) and can be negative if the model performs worse than mean prediction.

Pros

Measures the proportion of variance in \(y\) explained by the model.
Intuitive interpretation in terms of variance explained.

Cons

Negative values can occur if the model is worse than a simple baseline.
Sensitive to outliers.

Assumptions

Typically used for linear or near-linear relationships, but also used as a broad measure of fit for many regressors.

5D.3. Anomaly Detection Metrics

Precision@k, Recall@k, or F1@k

When anomalies are rare, ranking-based metrics like Precision@K can be more relevant:

\[ \text{Precision@k} = \frac{\text{True Anomalies within top-k scored instances}}{k} \]

Measures how many of the k highest‑scoring instances (those the model flags as most anomalous) are actually true anomalies. It answers the question: “If an analyst investigates the top k alerts, what fraction will be genuine anomalies?”

Pros

Focuses on the top anomalous points, matching real-world investigation limits.
Aligns with scenarios where only a fixed number (k) of anomalies can be examined.

Cons

Requires a well-defined “k” which can be arbitrary.
Doesn’t assess overall ranking quality beyond the top-k boundary.

AUROC / AUPRC for Anomaly Scores

Similar to classification, you can treat anomalies as the positive class. AUROC or AUPRC can measure how well the model distinguishes anomalies from normal points over multiple thresholds.

Pros

These are metrics that are threshold-independent.
Widely recognized framework for outlier detection.

Cons

In heavily skewed anomaly distributions, even PR curves can be misleading if extremely few anomalies exist.

Assumptions

Requires labeled anomaly data for evaluation, which may not always be available or fully reliable.

5D.4. Clustering Metrics

Silhouette Score

For each sample \(i\), let \(a(i)\) be the mean distance to points in the same cluster, and \(b(i)\) be the mean distance to the nearest different cluster. Then the silhouette for sample \(i\) is:

\[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \]

Measures the average clustering quality by evaluating how similar each data point is to others in its own cluster compared with points in the nearest neighbouring cluster. Ranging from -1 to 1, a data instance with a score of 1 indicates that it is well-matched to its cluster and well-separated from others; a score of 0 indcates that it is on or near a cluster boundary, and negative values indicates wrong cluster assignment.

Pros

Intuitive measure combining intra-cluster tightness and inter-cluster separation.
Quickly computed for any clustering method.

Cons

May not properly capture nuanced cluster structures (e.g., varying density or shape).
Inherently sensitive to distance metrics used (usually Euclidean).

Davies-Bouldin Index

\[ \text{DBI} = \frac{1}{K} \sum_{i=1}^{K} \max_{j \neq i} \frac{\sigma_i + \sigma_j}{d_{ij}} \]

Where \( \sigma_i \) is the average distance of points in cluster \( i \) to the cluster center, and \( d_{ij} \) is the distance between cluster centers \( i \) and \( j \).

Measures the average "similarity" between each cluster and its most similar (i.e., least-separated) neighbour. A DBI value of 0 represents perfectly distinct, non-overlapping clusters, and higher DBI values indicate more overlap or dispersion.

Pros

Measures how separated clusters are, lower DBI indicates better separation.
Generally easier to interpret than some older metrics like Dunn index.

Cons

Can be sensitive to outliers or non-spherical clusters.
Choice of distance metric influences results significantly.

Calinski–Harabasz Index

Let
\(N\) = total number of samples
\(K\) = number of clusters
\(n_k\) = size of cluster \(k\)
\(\mathbf{c}_k\) = centroid of cluster \(k\)
\(\bar{\mathbf{x}}\) = global centroid of all data
\(C_k\) = set of samples in cluster \(k\)

\[ \text{CH} = \frac{\text{Between‑cluster dispersion} \; /\; (K-1)} {\text{Within‑cluster dispersion} \;/\; (N-K)} = \frac{ \displaystyle \sum_{k=1}^{K} n_k \bigl\lVert \mathbf{c}_k-\bar{\mathbf{x}}\bigr\rVert^2 } {K-1} \Bigg/ \frac{ \displaystyle \sum_{k=1}^{K} \sum_{\mathbf{x}\in C_k} \bigl\lVert \mathbf{x}-\mathbf{c}_k\bigr\rVert^2 } {N-K} \]

Measures clustering quality by comparing how well clusters are separated from each other to how compact they are internally. A higher CH value indicates denser clusters with greater inter‑cluster separation—hence, “better” clustering. This metric is often used to compare different clustering solutions on the same dataset or to help choose \(K\) when combined with other metrics (e.g., Silhouette, Davies–Bouldin) and domain knowledge.

Pros

Scale‑invariant: Unaffected by uniform scaling of the feature space.
Simple interpretation: Larger values mean compact, well‑separated clusters.
No ground‑truth labels needed
Efficient to compute for most clustering results; requires only centroids and intra‑cluster distances.

Cons

Favors spherical, equal‑variance clusters; may undervalue clusters with different shapes or densities.
Monotonic with \(K\) for some datasets—can increase continuously as you add clusters, making “optimal K” ambiguous.
Sensitive to outliers; extreme points inflate within‑cluster dispersion, lowering CH.
Requires Euclidean‑like distance; non‑Euclidean metrics can violate assumptions behind the variance ratio.

Assumptions

Data in a continuous vector space.
Clusters are roughly convex or hyperspherical; elongated or density‑based structures may be mis‑scored.

5D.5. Baseline Heuristic Approaches

Regardless of the problem type, always establish a baseline based on these heuristics:

Classification: Predict the majority class or random guess.
Regression: Predict the mean or median of the training data.
Anomaly Detection: Label everything as normal or set an arbitrary threshold.
Clustering: Compare against random partitioning or a single-cluster solution.

These baselines help you gauge if your model’s performance exceeds naive or trivial strategies.

Fairness and Bias Metrics

In many real-world applications, especially in government, evaluating fairness and biasness in your models is paramount. Some of these metrics include:

Demographic Parity (difference in positive prediction rates across different demographic groups).
Equal Opportunity (difference in true positive rates across different demographic groups).
Equalized Odds (difference in false positive and false negative rates across different demographic groups).

Integrating these metrics in your model evaluation process ensures your model’s decisions are equitable across various demographic or protected groups in your population.

5D.6. Incorporating an Evaluation Step in AWS SageMaker Pipeline

This section explains how the evaluation step is defined in an AWS SageMaker pipeline so that every time a new tuned model is produced, it is evaluated on the held‑out test set and its metrics are written to S3.

1. Create a ScriptProcessor

# pipeline.py
script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=processing_instance_type,
    instance_count=PROCESSING_INSTANCE_COUNT,
    base_job_name=f"{base_job_prefix}/evaluate",
    sagemaker_session=pipeline_session,
    role=role,
)

ScriptProcessor spins up a transient processing container (here, a Python 3 image with XGBoost, pandas, etc.)
It will execute evaluate.py inside that container.

1. Define the input and expected output of the evaluation step

# pipeline.py
evaluation_inputs = [
    ProcessingInput(
        source=step_tuning.get_top_model_s3_uri(
            top_k=0, s3_bucket=default_bucket, prefix=model_prefix
        ),
        destination="/opt/ml/processing/model",
    ),
    ProcessingInput(
        source=step_process.properties.ProcessingOutputConfig.Outputs[
            "test"
        ].S3Output.S3Uri,
        destination="/opt/ml/processing/test",
    ),
]

evaluation_outputs = [
    ProcessingOutput(
        output_name="evaluation", source="/opt/ml/processing/evaluation"
    ),
]

Model artifact (best trial) and test CSV are mounted inside the container.
Metrics will be saved to /opt/ml/processing/evaluation/evaluation.json and automatically uploaded to S3.

3. Attach a PropertyFile for downstream reuse

# pipeline.py
evaluation_report = PropertyFile(
    name="EvaluationReport",
    output_name="evaluation",
    path="evaluation.json",
)

PropertyFile lets later pipeline steps (e.g., model registration or conditional approval) pull metrics such as MSE or R² from the JSON.

4. Define the step arguments and wrap in a ProcessingStep

# pipeline.py
eval_args = script_eval.run(
    inputs=evaluation_inputs,
    outputs=evaluation_outputs,
    code=os.path.join(BASE_DIR, "evaluate.py"),
)

step_eval = ProcessingStep(
    name="EvaluateModel",
    step_args=eval_args,
    property_files=[evaluation_report],
    cache_config=cache_config,
)

ProcessingStep inserts this evaluation node into the pipeline DAG; its outputs can be cached so identical evaluations aren’t re‑run.

5. Define model evaluation inside evaluate.py

# evaluate.py
import json
import logging
import pathlib
import pickle
import tarfile
import math
import numpy as np
import pandas as pd
import xgboost
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())


def mean_absolute_percentage_error(y_true: np.array, y_pred: np.array) -> float:
    """Calculates the mean absolute percentage error between the true and predicted resale prices.

    Args:
        y_true (np.array): Contains the true resale prices.
        y_pred (np.array): Contains the predicted resale prices.

    Return:
        float: The calculated mean absolute percentage error between the true and predicted resale prices.
    """
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100


if __name__ == "__main__":
    #########################
    # Download Model from S3
    #########################
    logger.debug("Starting evaluation.")
    model_path = "/opt/ml/processing/model/model.tar.gz"
    with tarfile.open(model_path) as tar:
        tar.extractall(path=".")

    #########################
    # Load Model
    #########################
    logger.debug("Loading xgboost model.")
    model = pickle.load(open("xgboost-model", "rb"))

    #########################
    # Download Dataset from S3
    #########################
    logger.debug("Reading test data.")
    test_path = "/opt/ml/processing/test/test.csv"
    df = pd.read_csv(test_path, header=None)

    #########################
    # Load Dataset
    #########################
    logger.debug("Reading test data.")
    y_test = df.iloc[:, 0].to_numpy()
    df.drop(df.columns[0], axis=1, inplace=True)
    X_test = xgboost.DMatrix(df.values)

    #########################
    # Predict on Dataset
    #########################
    logger.info("Performing predictions against test data.")
    predictions = model.predict(X_test)

    #########################
    # Generate Metrics
    #########################
    logger.debug("Calculating regression metrics.")
    mse = mean_squared_error(y_test, predictions)
    mae = mean_absolute_error(y_test, predictions)
    mape = mean_absolute_percentage_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    std = np.std(y_test - predictions)
    report_dict = {
        "regression_metrics": {
            "mse": {"value": mse, "standard_deviation": std},
            "rmse": {"value": math.sqrt(mse)},
            "mae": {"value": mae},
            "mape": {"value": mape},
            "r2": {"value": r2},
        },
    }

    #########################
    # Output Metrics to File in Path
    #########################
    # Files in /opt/ml/processing/evaluation which are uploaded to S3
    # The Path is based on evaluation_outputs in the pipelines.py file
    output_dir = "/opt/ml/processing/evaluation"
    pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

    logger.info("Writing out evaluation report with mae: %f", mae)
    evaluation_path = f"{output_dir}/evaluation.json"
    with open(evaluation_path, "w") as f:
        f.write(json.dumps(report_dict))

Downloads model.tar.gz from /model and loads the XGBoost model.
Loads test data from /test/test.csv, splits y_test and feature matrix.
Makes a prediction by calling model.predict.
Compute metrics – MSE, RMSE, MAE, MAPE, R² and the error standard deviation.
Write an evaluation report as a JSON file to /opt/ml/processing/evaluation/evaluation.json and SageMaker copies it to the S3 prefix defined in ProcessingOutput.

The resulting JSON file will look like this:

{
  "regression_metrics": {
    "mse":  { "value": 11234.5, "standard_deviation": 98.2 },
    "rmse": { "value": 105.9 },
    "mae":  { "value": 78.4 },
    "mape": { "value": 4.8 },
    "r2":   { "value": 0.92 }
  }
}

5D.7. Summary

Model evaluation is multi-faceted and should be tailored to the specific problem type:

Classification - Accuracy, Precision, Recall, F1, ROC AUC, PR AUC. - Suited to scenarios with binary or multi-class targets and should account for potential class imbalance in dataset.

Regression * MSE, RMSE, MAE, \(R^2\) * Focuses on how closely predictions match continuous targets.

Anomaly Detection * Adapt classification metrics (Precision@k, Recall@k, AUROC, AUPRC), or specialized ranking-based metrics. * Evaluate how effectively the model flags outliers vs. inliers.

Clustering * Silhouette Score, Davies-Bouldin, or Calinski-Harabasz. * Assess how well data points form cohesive yet separate clusters.

Always consider baseline heuristics (e.g., predicting the mean, majority class, or random assignment) to verify that your model outperforms naive strategies. Moreover, in sensitive applications, incorporate fairness and bias metrics to ensure ethical and equitable outcomes. By adapting your AWS SageMaker pipeline to run custom evaluation scripts, you can automate the process of computing and storing relevant performance metrics at scale.