Chapter 5F: Build & Deploy

In this chapter, we’ll walk through the process of configuring up a CI/CD pipeline on SHIP-HATS GitLab to manage your deployment workflow (from model build to endpoint creation) and steps to troubleshoot build failures. Although testing (unit, system, integration) is not a primary focus here, the chapter also briefly describes different best practices related to various testing strategies. Keep in mind that the specifics of your workflow will depend on your unique setup and preferences for handling development and deployment. There’s no one-size-fits-all approach.

By the end of this chapter, you’ll be able to configure a SHIP-HATS GitLab pipeline that promotes models through different environments based on branch merges—ensuring a consistent, traceable, and auditable deployment process.

5F.1. Overview of the Development to Deployment Flow

Feature Branch Creation
A new feature branch is created from the main branch. Some examples of development work that is performed on this branch including model updating, model re-training or feature building. Once preliminary development work is completed, this branch is merged into the dev branch.
Commit to dev Branch
The dev branch focuses on model building. A code commit on this branch triggers an automated build pipeline that spins up a runner, checks out the repository, and runs build commands specified in your .gitlab-ci.yml, which includes model training, validation checks, and artifact generation. After manual approval, the model is deployed to the dev environment and the API endpoint is generated and validated.
Commit dev Branch into uat Branch
Merging to uat initiates testing-only pipeline (no rebuilding). This includes model validation, A/B testing, and performance checks. The existing model artifacts are deployed to the uat environment, and API endpoints are generated and validated.
Commit uat Branch into main Branch
The main branch handles deployment only (no rebuilding). Merging to main triggers deployment pipeline for production release. After final approval, the validated model is deployed to production. Final endpoint validation confirms successful deployment.
API Endpoint Generation and Testing
Each deployment step includes endpoint generation script and subsequent testing to verify connectivity and responses, without rebuilding the model artifacts.

5F.2. Prerequisites

GitLab Repository Setup
You have a SHIPHATS GitLab repository with the following branches: dev, uat, and main.
GitLab Runners
A GitLab runner (shared or self-hosted) must be configured to execute your CI/CD jobs.
Credentials and Environment Variables
Store environment-specific credentials (e.g., AWS keys, Docker credentials) as secure CI/CD Variables in GitLab.
Dev, UAT, and Production Environments
Ensure each environment (dev, uat, production) can host or serve your model—for example, on AWS SageMaker.

5F.3. Setting Up A CI/CD Pipeline on SHIPHATS GitLab

1. Configure your `.gitlab-ci.yml` File

Below is an example of how your .gitlab-ci.yml file could look. It demonstrates multiple stages, including model building, testing, deployment pipeline setup, model checks, approvals, and endpoint validation for the dev, uat, and production environments. Each stage can be adapted to your specific workflow, such as creating or updating an AWS SageMaker endpoint.

variables:
  MODEL_PACKAGE_GROUP_NAME: "ResalePriceProjectPackageGroup"
  INSTANCE_TYPE: "ml.m5.large"
  INSTANCE_COUNT: "1"
  MAX_WAIT_TIME: "900"
  POLL_INTERVAL: "30"

stages:
  - deploy_pipeline
  - check_model
  - approve_model
  - deploy_model
  - validate_endpoint

.default-setup: &default-setup
  image: nexus-docker.ship.gov.sg/python:3.8-slim
  tags: [ship_docker]
  id_tokens:
    AWS_OIDC_TOKEN:
      aud: https://sgts.gitlab-dedicated.com
  before_script:
    - apt-get update && apt-get install -y curl unzip jq
    - pip install --upgrade pip
    - pip install boto3 sagemaker awscli
    - mkdir -p ~/.aws
    - echo "${AWS_OIDC_TOKEN}" > /tmp/web_identity_token.txt
    - echo -e "[profile oidc]\nrole_arn=${AWS_ROLE_ARN}\nweb_identity_token_file=/tmp/web_identity_token.txt" > ~/.aws/config
    - export AWS_PROFILE=oidc
    - export AWS_DEFAULT_REGION=$AWS_REGION
    - aws sts get-caller-identity

.deploy_model_template: &deploy_model_template
  <<: *default-setup
  script:
    - source model_metadata.env
    - |
      if [ "$MODEL_FOUND" == "false" ]; then
        echo "No approved model. Skipping deployment."
        exit 0
      fi

      MODEL_PACKAGE_ARN=$(aws sagemaker list-model-packages \
        --model-package-group-name "$MODEL_PACKAGE_GROUP_NAME" \
        --query "reverse(sort_by(ModelPackageSummaryList[?ModelApprovalStatus=='Approved'], &CreationTime))[0].ModelPackageArn" \
        --output text 2>/dev/null || echo "")

      if [ -z "$MODEL_PACKAGE_ARN" ] || [ "$MODEL_PACKAGE_ARN" == "None" ]; then
        echo "No approved model found."
        exit 1
      fi

      MODEL_NAME="resale-price-model-${CI_ENVIRONMENT_NAME}-${CI_COMMIT_SHORT_SHA}"
      CONFIG_NAME="resale-price-config-${CI_ENVIRONMENT_NAME}-${CI_COMMIT_SHORT_SHA}"
      ENDPOINT_NAME="resale-price-endpoint-${CI_ENVIRONMENT_NAME}"

      aws sagemaker create-model \
        --model-name "$MODEL_NAME" \
        --containers "[{\"ModelPackageName\": \"$MODEL_PACKAGE_ARN\"}]" \
        --execution-role-arn "$AWS_ROLE_ARN"

      aws sagemaker create-endpoint-config \
        --endpoint-config-name "$CONFIG_NAME" \
        --production-variants "[{\"VariantName\": \"AllTraffic\", \"ModelName\": \"$MODEL_NAME\", \"InstanceType\": \"$INSTANCE_TYPE\", \"InitialInstanceCount\": $INSTANCE_COUNT}]"

      ENDPOINT_STATUS=$(aws sagemaker describe-endpoint --endpoint-name "$ENDPOINT_NAME" --query "EndpointStatus" --output text 2>/dev/null || echo "NOT_FOUND")

      if [ "$ENDPOINT_STATUS" == "InService" ] || [ "$ENDPOINT_STATUS" == "Updating" ]; then
        aws sagemaker update-endpoint --endpoint-name "$ENDPOINT_NAME" --endpoint-config-name "$CONFIG_NAME"
      else
        aws sagemaker create-endpoint --endpoint-name "$ENDPOINT_NAME" --endpoint-config-name "$CONFIG_NAME"
      fi

.validate_endpoint_template: &validate_endpoint_template
  <<: *default-setup
  script:
    - source model_metadata.env
    - |
      ENDPOINT_NAME="resale-price-endpoint-${CI_ENVIRONMENT_NAME}"
      TIME_ELAPSED=0
      while true; do
        STATUS=$(aws sagemaker describe-endpoint --endpoint-name "$ENDPOINT_NAME" --query "EndpointStatus" --output text 2>/dev/null || echo "NOT_FOUND")
        if [ "$STATUS" == "InService" ]; then
          echo "Endpoint is now in service!"
          break
        fi
        if [ "$STATUS" == "Failed" ] || [ "$TIME_ELAPSED" -ge "$MAX_WAIT_TIME" ]; then
          echo "Endpoint failed or timed out."
          exit 1
        fi
        echo "Waiting... Status: $STATUS"
        sleep $POLL_INTERVAL
        TIME_ELAPSED=$((TIME_ELAPSED + POLL_INTERVAL))
      done

deploy_pipeline:
  <<: *default-setup
  stage: deploy_pipeline
  script:
    - python -m pipelines.get_pipeline_definition -n model_code.pipeline -f pipeline.json --region $AWS_REGION
    - python -m pipelines.run_pipeline -n model_code.pipeline -role-arn "$AWS_ROLE_ARN"
  environment: $CI_COMMIT_BRANCH
  rules:
    - if: '$CI_COMMIT_BRANCH == "dev"'

check_model:
  <<: *default-setup
  stage: check_model
  script:
    - |
      if [ "$CI_COMMIT_BRANCH" == "dev" ]; then
        MODEL_PACKAGE_ARN=$(aws sagemaker list-model-packages \
          --model-package-group-name "$MODEL_PACKAGE_GROUP_NAME" \
          --query "reverse(sort_by(ModelPackageSummaryList[?ModelApprovalStatus=='PendingManualApproval'], &CreationTime))[0].ModelPackageArn" \
          --output text 2>/dev/null || echo "")
      else
        MODEL_PACKAGE_ARN=$(aws sagemaker list-model-packages \
          --model-package-group-name "$MODEL_PACKAGE_GROUP_NAME" \
          --query "reverse(sort_by(ModelPackageSummaryList[?ModelApprovalStatus=='Approved'], &CreationTime))[0].ModelPackageArn" \
          --output text 2>/dev/null || echo "")
      fi

      if [ -z "$MODEL_PACKAGE_ARN" ] || [ "$MODEL_PACKAGE_ARN" == "None" ]; then
        echo "MODEL_FOUND=false" > model_metadata.env
      else
        echo "MODEL_FOUND=true" > model_metadata.env
        echo "MODEL_PACKAGE_ARN=$MODEL_PACKAGE_ARN" >> model_metadata.env
      fi
  artifacts:
    paths: [model_metadata.env]
  environment: $CI_COMMIT_BRANCH
  rules:
    - if: '$CI_COMMIT_BRANCH == "dev"'
    - if: '$CI_COMMIT_BRANCH == "uat"'
    - if: '$CI_COMMIT_BRANCH == "main"'

approve_model:
  <<: *default-setup
  stage: approve_model
  script:
    - source model_metadata.env
    - |
      if [ "$CI_COMMIT_BRANCH" == "dev" ]; then
        if [ "$MODEL_FOUND" == "true" ]; then
          aws sagemaker update-model-package \
            --model-package-arn "$MODEL_PACKAGE_ARN" \
            --model-approval-status "Approved"
        else
          echo "No pending model found for manual approval in $CI_COMMIT_BRANCH." >&2
          exit 1
        fi
      else
        if [ "$MODEL_FOUND" == "true" ]; then
          echo "Using approved model for $CI_COMMIT_BRANCH: $MODEL_PACKAGE_ARN"
        else
          echo "No approved model found for deployment in $CI_COMMIT_BRANCH." >&2
          exit 1
        fi
      fi
  artifacts:
    paths: [model_metadata.env]
  when: manual
  allow_failure: false
  environment: $CI_COMMIT_BRANCH
  rules:
    - if: '$CI_COMMIT_BRANCH == "dev"'
    - if: '$CI_COMMIT_BRANCH == "uat"'
    - if: '$CI_COMMIT_BRANCH == "main"'

deploy_model:
  <<: *deploy_model_template
  stage: deploy_model
  dependencies: [approve_model]
  artifacts:
    paths: [model_metadata.env]
  environment: $CI_COMMIT_BRANCH
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
    - if: '$CI_COMMIT_BRANCH == "uat"'
    - if: '$CI_COMMIT_BRANCH == "dev"'
      when: on_success

validate_endpoint:
  <<: *validate_endpoint_template
  stage: validate_endpoint
  dependencies: [deploy_model]
  environment: $CI_COMMIT_BRANCH
  rules:
    - if: '$CI_COMMIT_BRANCH == "dev"'
    - if: '$CI_COMMIT_BRANCH == "uat"'
    - if: '$CI_COMMIT_BRANCH == "main"'

Explanation of Key Sections

Variables Section: Sets reusable defaults for model package group name, instance type/count, and polling timeouts.
Stages Section: Defines the CI/CD workflow phases: pipeline deployment, model check, manual approval, model deployment, and endpoint validation.
default-setup: Common job settings including Docker image, AWS OIDC authentication, and installation of AWS CLI, Boto3, and SageMaker SDK.
deploy_model_template: Shared script logic to skip deployment if no approved model exists, then create or update the SageMaker model, endpoint configuration, and endpoint.
validate_endpoint_template: Shared polling loop that waits for the SageMaker endpoint to become “InService” or fail after a timeout.
deploy_pipeline: Executes Python commands to register or update the SageMaker pipeline definition; scoped to the dev branch.
check_model: Queries SageMaker for pending or approved model packages, exports MODEL_FOUND and MODEL_PACKAGE_ARN for downstream use.
approve_model: Manual gate that either approves a pending model on dev or confirms approval on uat/main; writes metadata for later steps.
deploy_model: Deploys the approved model using the deployment template, conditional on successful approval.
validate_endpoint: Runs the validation template to ensure the deployed endpoint is active before finishing.

2. Verify Pipeline Settings

In your SHIPHATS GitLab project, go to Settings > CI/CD to confirm:
- Shared Runners or Specific Runners are available.
- CI/CD Variables are securely stored. Variables that you may want to include are: AWS_ACCOUNT_ID, AWS_REGION, AWS_ROLE_ARN, S3_BUCKET_NAME
(Optional) Protect your dev branch if you want to restrict who can directly push or merge code into it.

3. Perform First Commit to the `dev` Branch

With your .gitlab-ci.yml file and, optionally, a local environment in place, you can verify that the CI/CD pipeline is set up correctly as follows:

Make a Small Change (e.g., editing the README.md file)

Stage and Commit

git add README.md
git commit -m "test: initial commit for verifying pipeline"
git push origin dev

Check GitLab Pipeline
- In your SHIPHATS GitLab project, go to CI/CD > Build > Pipelines.
- Observe the pipeline for the dev branch. If configured correctly, you’ll see a deploy_pipeline job triggered.
- When the pipeline is successfully built, you will see a green check mark next to the pipeline run in the CI/CD > Pipelines section

Tip: You can also configure GitLab Notifications (Slack, email) to get immediate alerts when a build fails or succeeds.

5F.4. Debugging Failed Builds

If your build fails, here are several places you should go to find the error messages needed for debugging:

GitLab Pipeline Logs
- Go to the CI/CD > Build > Pipelines section in your SHIPHATS GitLab project.
- Locate the failed pipeline run and click Jobs to view detailed logs.
- Look for error messages or tracebacks in the console output.
SageMaker AI Studio
- If you are using AWS SageMaker Pipelines integrated with your GitLab, head to SageMaker AI Studio.
- Navigate to the Pipelines tab to find the corresponding run.
- Detailed error messages (e.g., from Docker builds or training jobs) may appear here.
CloudWatch Logs
- If your build is running tasks on AWS infrastructure, relevant logs may be available in Amazon CloudWatch.
- Check for error messages related to insufficient permissions, resource limits, or misconfiguration.

5F.5. Brief Overview of Testing Types

While testing is not the main focus of this guide, it’s important to note that different tests can be integrated at multiple stages of your CI/CD pipeline:

Unit Tests
- Focus on small, isolated chunks of code (e.g., individual functions or classes).
- Quick to run, easy to maintain.
Integration Tests
- Validate that different components (e.g., data processing, model training) work together correctly.
- Involves using real or mock dependencies.
System Tests
- Test the entire pipeline end-to-end, including external systems like databases, message queues, or third-party APIs.
- More complex, potentially time-consuming.

Depending on your project’s complexity, you can incorporate these tests in separate .gitlab-ci.yml jobs. For instance, after a build stage, you can add a unit_test stage, followed by an integration_test stage, ensuring your code base remains stable before deployment.

5F.6. Common Considerations and Best Practices

Access Control
Restrict who can approve dev, uat, or prod deployments. This ensures that only authorized team members can push changes into higher environments.
Rollback Strategy
Maintain versioned artifacts. If a newly deployed model underperforms or fails, you can redeploy an earlier stable model or use a “blue/green” deployment strategy.
Testing
Incorporate unit, integration, and load tests before or after deployment to ensure code quality and endpoint stability.
Notifications
Configure Slack or email alerts for pipeline events (e.g., build failures, approvals needed).
Security
Use CI/CD Variables to store AWS credentials and other sensitive data. Avoid printing secrets in logs or storing them in repository files.

5F.7. Summary

With this .gitlab-ci.yml setup, you can see how each environment—dev, uat, and prod—receives its own manual approval gate, deployment step, and endpoint validation phase. The granular stages (e.g., check_model, deploy_pipeline) also enable flexible control over each part of the build-and-deploy sequence. This approach not only provides a streamlined, automated path for your model to move from development to production but also ensures there are auditable checkpoints at each step of the way.

In cases of failure, logs and error messages in the CI/CD > Pipelines section, SageMaker AI Studio, or CloudWatch can help you pinpoint the root cause.

Testing—though not the main focus here—can also be seamlessly integrated into the pipeline at multiple levels (unit, integration, system testing) to maintain a robust code base. With a successful pipeline in place, you’ll have greater confidence in your build artifacts, ready to move on to more advanced stages like model deployment, monitoring, and updating.

As you refine your pipeline, consider advanced features like:

Canary or Blue/Green deployments for safer rollouts.
Automated load testing or advanced monitoring steps in dev/uat before production.
Policy-based approvals to ensure the right stakeholders sign off on each environment promotion.

By adopting these practices, your organization will be well-equipped to handle the full lifecycle of machine learning models in a scalable, reliable, and secure manner.