Chapter 5F: Build & Deploy
In this chapter, we’ll walk through the process of configuring up a CI/CD pipeline on SHIP-HATS GitLab to manage your deployment workflow (from model build to endpoint creation) and steps to troubleshoot build failures. Although testing (unit, system, integration) is not a primary focus here, the chapter also briefly describes different best practices related to various testing strategies. Keep in mind that the specifics of your workflow will depend on your unique setup and preferences for handling development and deployment. There’s no one-size-fits-all approach.
By the end of this chapter, you’ll be able to configure a SHIP-HATS GitLab pipeline that promotes models through different environments based on branch merges—ensuring a consistent, traceable, and auditable deployment process.
5F.1. Overview of the Development to Deployment Flow
-
Feature Branch Creation
A new feature branch is created from themainbranch. Some examples of development work that is performed on this branch including model updating, model re-training or feature building. Once preliminary development work is completed, this branch is merged into thedevbranch. -
Commit to
devBranch
Thedevbranch focuses on model building. A code commit on this branch triggers an automated build pipeline that spins up a runner, checks out the repository, and runs build commands specified in your.gitlab-ci.yml, which includes model training, validation checks, and artifact generation. After manual approval, the model is deployed to the dev environment and the API endpoint is generated and validated. -
Commit
devBranch intouatBranch
Merging touatinitiates testing-only pipeline (no rebuilding). This includes model validation, A/B testing, and performance checks. The existing model artifacts are deployed to the uat environment, and API endpoints are generated and validated. -
Commit
uatBranch intomainBranch
Themainbranch handles deployment only (no rebuilding). Merging tomaintriggers deployment pipeline for production release. After final approval, the validated model is deployed to production. Final endpoint validation confirms successful deployment. -
API Endpoint Generation and Testing
Each deployment step includes endpoint generation script and subsequent testing to verify connectivity and responses, without rebuilding the model artifacts.
5F.2. Prerequisites
-
GitLab Repository Setup
You have a SHIPHATS GitLab repository with the following branches:dev,uat, andmain. -
GitLab Runners
A GitLab runner (shared or self-hosted) must be configured to execute your CI/CD jobs. -
Credentials and Environment Variables
Store environment-specific credentials (e.g., AWS keys, Docker credentials) as secure CI/CD Variables in GitLab. -
Dev, UAT, and Production Environments
Ensure each environment (dev, uat, production) can host or serve your model—for example, on AWS SageMaker.
5F.3. Setting Up A CI/CD Pipeline on SHIPHATS GitLab
1. Configure your .gitlab-ci.yml File
Below is an example of how your .gitlab-ci.yml file could look. It demonstrates multiple stages, including model building, testing, deployment pipeline setup, model checks, approvals, and endpoint validation for the dev, uat, and production environments. Each stage can be adapted to your specific workflow, such as creating or updating an AWS SageMaker endpoint.
variables:
MODEL_PACKAGE_GROUP_NAME: "ResalePriceProjectPackageGroup"
INSTANCE_TYPE: "ml.m5.large"
INSTANCE_COUNT: "1"
MAX_WAIT_TIME: "900"
POLL_INTERVAL: "30"
stages:
- deploy_pipeline
- check_model
- approve_model
- deploy_model
- validate_endpoint
.default-setup: &default-setup
image: nexus-docker.ship.gov.sg/python:3.8-slim
tags: [ship_docker]
id_tokens:
AWS_OIDC_TOKEN:
aud: https://sgts.gitlab-dedicated.com
before_script:
- apt-get update && apt-get install -y curl unzip jq
- pip install --upgrade pip
- pip install boto3 sagemaker awscli
- mkdir -p ~/.aws
- echo "${AWS_OIDC_TOKEN}" > /tmp/web_identity_token.txt
- echo -e "[profile oidc]\nrole_arn=${AWS_ROLE_ARN}\nweb_identity_token_file=/tmp/web_identity_token.txt" > ~/.aws/config
- export AWS_PROFILE=oidc
- export AWS_DEFAULT_REGION=$AWS_REGION
- aws sts get-caller-identity
.deploy_model_template: &deploy_model_template
<<: *default-setup
script:
- source model_metadata.env
- |
if [ "$MODEL_FOUND" == "false" ]; then
echo "No approved model. Skipping deployment."
exit 0
fi
MODEL_PACKAGE_ARN=$(aws sagemaker list-model-packages \
--model-package-group-name "$MODEL_PACKAGE_GROUP_NAME" \
--query "reverse(sort_by(ModelPackageSummaryList[?ModelApprovalStatus=='Approved'], &CreationTime))[0].ModelPackageArn" \
--output text 2>/dev/null || echo "")
if [ -z "$MODEL_PACKAGE_ARN" ] || [ "$MODEL_PACKAGE_ARN" == "None" ]; then
echo "No approved model found."
exit 1
fi
MODEL_NAME="resale-price-model-${CI_ENVIRONMENT_NAME}-${CI_COMMIT_SHORT_SHA}"
CONFIG_NAME="resale-price-config-${CI_ENVIRONMENT_NAME}-${CI_COMMIT_SHORT_SHA}"
ENDPOINT_NAME="resale-price-endpoint-${CI_ENVIRONMENT_NAME}"
aws sagemaker create-model \
--model-name "$MODEL_NAME" \
--containers "[{\"ModelPackageName\": \"$MODEL_PACKAGE_ARN\"}]" \
--execution-role-arn "$AWS_ROLE_ARN"
aws sagemaker create-endpoint-config \
--endpoint-config-name "$CONFIG_NAME" \
--production-variants "[{\"VariantName\": \"AllTraffic\", \"ModelName\": \"$MODEL_NAME\", \"InstanceType\": \"$INSTANCE_TYPE\", \"InitialInstanceCount\": $INSTANCE_COUNT}]"
ENDPOINT_STATUS=$(aws sagemaker describe-endpoint --endpoint-name "$ENDPOINT_NAME" --query "EndpointStatus" --output text 2>/dev/null || echo "NOT_FOUND")
if [ "$ENDPOINT_STATUS" == "InService" ] || [ "$ENDPOINT_STATUS" == "Updating" ]; then
aws sagemaker update-endpoint --endpoint-name "$ENDPOINT_NAME" --endpoint-config-name "$CONFIG_NAME"
else
aws sagemaker create-endpoint --endpoint-name "$ENDPOINT_NAME" --endpoint-config-name "$CONFIG_NAME"
fi
.validate_endpoint_template: &validate_endpoint_template
<<: *default-setup
script:
- source model_metadata.env
- |
ENDPOINT_NAME="resale-price-endpoint-${CI_ENVIRONMENT_NAME}"
TIME_ELAPSED=0
while true; do
STATUS=$(aws sagemaker describe-endpoint --endpoint-name "$ENDPOINT_NAME" --query "EndpointStatus" --output text 2>/dev/null || echo "NOT_FOUND")
if [ "$STATUS" == "InService" ]; then
echo "Endpoint is now in service!"
break
fi
if [ "$STATUS" == "Failed" ] || [ "$TIME_ELAPSED" -ge "$MAX_WAIT_TIME" ]; then
echo "Endpoint failed or timed out."
exit 1
fi
echo "Waiting... Status: $STATUS"
sleep $POLL_INTERVAL
TIME_ELAPSED=$((TIME_ELAPSED + POLL_INTERVAL))
done
deploy_pipeline:
<<: *default-setup
stage: deploy_pipeline
script:
- python -m pipelines.get_pipeline_definition -n model_code.pipeline -f pipeline.json --region $AWS_REGION
- python -m pipelines.run_pipeline -n model_code.pipeline -role-arn "$AWS_ROLE_ARN"
environment: $CI_COMMIT_BRANCH
rules:
- if: '$CI_COMMIT_BRANCH == "dev"'
check_model:
<<: *default-setup
stage: check_model
script:
- |
if [ "$CI_COMMIT_BRANCH" == "dev" ]; then
MODEL_PACKAGE_ARN=$(aws sagemaker list-model-packages \
--model-package-group-name "$MODEL_PACKAGE_GROUP_NAME" \
--query "reverse(sort_by(ModelPackageSummaryList[?ModelApprovalStatus=='PendingManualApproval'], &CreationTime))[0].ModelPackageArn" \
--output text 2>/dev/null || echo "")
else
MODEL_PACKAGE_ARN=$(aws sagemaker list-model-packages \
--model-package-group-name "$MODEL_PACKAGE_GROUP_NAME" \
--query "reverse(sort_by(ModelPackageSummaryList[?ModelApprovalStatus=='Approved'], &CreationTime))[0].ModelPackageArn" \
--output text 2>/dev/null || echo "")
fi
if [ -z "$MODEL_PACKAGE_ARN" ] || [ "$MODEL_PACKAGE_ARN" == "None" ]; then
echo "MODEL_FOUND=false" > model_metadata.env
else
echo "MODEL_FOUND=true" > model_metadata.env
echo "MODEL_PACKAGE_ARN=$MODEL_PACKAGE_ARN" >> model_metadata.env
fi
artifacts:
paths: [model_metadata.env]
environment: $CI_COMMIT_BRANCH
rules:
- if: '$CI_COMMIT_BRANCH == "dev"'
- if: '$CI_COMMIT_BRANCH == "uat"'
- if: '$CI_COMMIT_BRANCH == "main"'
approve_model:
<<: *default-setup
stage: approve_model
script:
- source model_metadata.env
- |
if [ "$CI_COMMIT_BRANCH" == "dev" ]; then
if [ "$MODEL_FOUND" == "true" ]; then
aws sagemaker update-model-package \
--model-package-arn "$MODEL_PACKAGE_ARN" \
--model-approval-status "Approved"
else
echo "No pending model found for manual approval in $CI_COMMIT_BRANCH." >&2
exit 1
fi
else
if [ "$MODEL_FOUND" == "true" ]; then
echo "Using approved model for $CI_COMMIT_BRANCH: $MODEL_PACKAGE_ARN"
else
echo "No approved model found for deployment in $CI_COMMIT_BRANCH." >&2
exit 1
fi
fi
artifacts:
paths: [model_metadata.env]
when: manual
allow_failure: false
environment: $CI_COMMIT_BRANCH
rules:
- if: '$CI_COMMIT_BRANCH == "dev"'
- if: '$CI_COMMIT_BRANCH == "uat"'
- if: '$CI_COMMIT_BRANCH == "main"'
deploy_model:
<<: *deploy_model_template
stage: deploy_model
dependencies: [approve_model]
artifacts:
paths: [model_metadata.env]
environment: $CI_COMMIT_BRANCH
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
- if: '$CI_COMMIT_BRANCH == "uat"'
- if: '$CI_COMMIT_BRANCH == "dev"'
when: on_success
validate_endpoint:
<<: *validate_endpoint_template
stage: validate_endpoint
dependencies: [deploy_model]
environment: $CI_COMMIT_BRANCH
rules:
- if: '$CI_COMMIT_BRANCH == "dev"'
- if: '$CI_COMMIT_BRANCH == "uat"'
- if: '$CI_COMMIT_BRANCH == "main"'
Explanation of Key Sections
- Variables Section: Sets reusable defaults for model package group name, instance type/count, and polling timeouts.
- Stages Section: Defines the CI/CD workflow phases: pipeline deployment, model check, manual approval, model deployment, and endpoint validation.
default-setup: Common job settings including Docker image, AWS OIDC authentication, and installation of AWS CLI, Boto3, and SageMaker SDK.deploy_model_template: Shared script logic to skip deployment if no approved model exists, then create or update the SageMaker model, endpoint configuration, and endpoint.validate_endpoint_template: Shared polling loop that waits for the SageMaker endpoint to become “InService” or fail after a timeout.deploy_pipeline: Executes Python commands to register or update the SageMaker pipeline definition; scoped to thedevbranch.check_model: Queries SageMaker for pending or approved model packages, exportsMODEL_FOUNDandMODEL_PACKAGE_ARNfor downstream use.approve_model: Manual gate that either approves a pending model ondevor confirms approval onuat/main; writes metadata for later steps.deploy_model: Deploys the approved model using the deployment template, conditional on successful approval.validate_endpoint: Runs the validation template to ensure the deployed endpoint is active before finishing.
2. Verify Pipeline Settings
- In your SHIPHATS GitLab project, go to Settings > CI/CD to confirm:
- Shared Runners or Specific Runners are available.
- CI/CD Variables are securely stored. Variables that you may want to include are:
AWS_ACCOUNT_ID,AWS_REGION,AWS_ROLE_ARN,S3_BUCKET_NAME
- (Optional) Protect your
devbranch if you want to restrict who can directly push or merge code into it.
3. Perform First Commit to the dev Branch
With your .gitlab-ci.yml file and, optionally, a local environment in place, you can verify that the CI/CD pipeline is set up correctly as follows:
- Make a Small Change (e.g., editing the README.md file)
-
Stage and Commit
-
Check GitLab Pipeline
- In your SHIPHATS GitLab project, go to CI/CD > Build > Pipelines.
- Observe the pipeline for the
devbranch. If configured correctly, you’ll see a deploy_pipeline job triggered. - When the pipeline is successfully built, you will see a green check mark next to the pipeline run in the CI/CD > Pipelines section
Tip: You can also configure GitLab Notifications (Slack, email) to get immediate alerts when a build fails or succeeds.
5F.4. Debugging Failed Builds
If your build fails, here are several places you should go to find the error messages needed for debugging:
-
GitLab Pipeline Logs
- Go to the CI/CD > Build > Pipelines section in your SHIPHATS GitLab project.
- Locate the failed pipeline run and click Jobs to view detailed logs.
- Look for error messages or tracebacks in the console output.
-
SageMaker AI Studio
- If you are using AWS SageMaker Pipelines integrated with your GitLab, head to SageMaker AI Studio.
- Navigate to the Pipelines tab to find the corresponding run.
- Detailed error messages (e.g., from Docker builds or training jobs) may appear here.
-
CloudWatch Logs
- If your build is running tasks on AWS infrastructure, relevant logs may be available in Amazon CloudWatch.
- Check for error messages related to insufficient permissions, resource limits, or misconfiguration.
5F.5. Brief Overview of Testing Types
While testing is not the main focus of this guide, it’s important to note that different tests can be integrated at multiple stages of your CI/CD pipeline:
-
Unit Tests
- Focus on small, isolated chunks of code (e.g., individual functions or classes).
- Quick to run, easy to maintain.
-
Integration Tests
- Validate that different components (e.g., data processing, model training) work together correctly.
- Involves using real or mock dependencies.
-
System Tests
- Test the entire pipeline end-to-end, including external systems like databases, message queues, or third-party APIs.
- More complex, potentially time-consuming.
Depending on your project’s complexity, you can incorporate these tests in separate .gitlab-ci.yml jobs. For instance, after a build stage, you can add a unit_test stage, followed by an integration_test stage, ensuring your code base remains stable before deployment.
5F.6. Common Considerations and Best Practices
-
Access Control
Restrict who can approve dev, uat, or prod deployments. This ensures that only authorized team members can push changes into higher environments. -
Rollback Strategy
Maintain versioned artifacts. If a newly deployed model underperforms or fails, you can redeploy an earlier stable model or use a “blue/green” deployment strategy. -
Testing
Incorporate unit, integration, and load tests before or after deployment to ensure code quality and endpoint stability. -
Notifications
Configure Slack or email alerts for pipeline events (e.g., build failures, approvals needed). -
Security
Use CI/CD Variables to store AWS credentials and other sensitive data. Avoid printing secrets in logs or storing them in repository files.
5F.7. Summary
With this .gitlab-ci.yml setup, you can see how each environment—dev, uat, and prod—receives its own manual approval gate, deployment step, and endpoint validation phase. The granular stages (e.g., check_model, deploy_pipeline) also enable flexible control over each part of the build-and-deploy sequence. This approach not only provides a streamlined, automated path for your model to move from development to production but also ensures there are auditable checkpoints at each step of the way.
In cases of failure, logs and error messages in the CI/CD > Pipelines section, SageMaker AI Studio, or CloudWatch can help you pinpoint the root cause.
Testing—though not the main focus here—can also be seamlessly integrated into the pipeline at multiple levels (unit, integration, system testing) to maintain a robust code base. With a successful pipeline in place, you’ll have greater confidence in your build artifacts, ready to move on to more advanced stages like model deployment, monitoring, and updating.
As you refine your pipeline, consider advanced features like:
- Canary or Blue/Green deployments for safer rollouts.
- Automated load testing or advanced monitoring steps in dev/uat before production.
- Policy-based approvals to ensure the right stakeholders sign off on each environment promotion.
By adopting these practices, your organization will be well-equipped to handle the full lifecycle of machine learning models in a scalable, reliable, and secure manner.