Chapter 4: Reference Guide for Stage 2 MLOps Setup

Some of the resources linked in this chapter are not openly accessible at this time. For access to these scripts, please contact Victor Ong (Victor_ONG@tech.gov.sg) or Richard Yan (Richard_YAN@tech.gov.sg).

In this chapter, we present an MLOps framework that outlines the essential components for establishing a robust MLOps environment—from infrastructure setup to security considerations. The framework highlights key tools and best practices necessary for effective MLOps implementation.

We begin by examining how your agency can progress to Stage 2 MLOps and beyond. Reaching this level typically demands more advanced capabilities, which vary based on your agency’s specific operational needs and priorities. We address practical challenges such as team readiness, choosing between vendor solutions and in-house development, cloud versus on-premise deployment, and other MLOps dimensions that may be unique to your context. Sections 4.1 to 4.3 focus on these foundational considerations.

From Section 4.4 onward, the emphasis shifts to tooling and architectural patterns that enable scalable, secure, and efficient MLOps environments.

4.1. Before You Begin: Vendor Vs In-House VS Hybrid

Vendor-Managed

This is a hands-off model, which can pose challenges for MLOps. Since model development is managed by the vendor, opportunities for continuous experimentation and iteration are limited.

It is best suited for use cases where minimal experimentation is expected and your agency is prepared to manage the risks and complexities of transitioning between vendors when necessary.

In this arrangement, consider using this playbook to explore what an MLOps setup might look like using either mock or production data. Try developing a proof of concept (POC) in-house, and use the insights gained to draft a blueprint that can guide your tender specifications. Keep in mind that maintenance requirements may be difficult to change later on, so plan carefully. Consult AI Practice, if you need more guidance.

Hybrid

In this setup, the default is for the vendor to manage the production environment, while your in-house team manages development or both development and UAT environments.

This approach supports ongoing development work (which is crucial!) even after deployment. You'll be primarily responsible for development and testing, while guiding the vendor on how to maintain and monitor the model in production. You'll also need to align the DevOps setup across environments.

For development work, agencies can choose between using an in-house GCC development platform, a vendor-provided platform, or MAESTRO. Be sure to explore the pros and cons of each option. Consult with the AI Practice—we can recommend and support the best hybrid solutions.

Fully In-House

This means your team handles everything: build, deploy, testing, security, monitor and maintenance.

It's ideal for more mature teams that want full control and have the capability to manage an MLOps system end-to-end. It also suits teams that already have a strong software engineering setup and just need to integrate MLOps principles into existing workflows.

It will also be much easier to move from experimentation to production compared to vendor-based models. However, be prepared to hire specialist and engineers to help with day 2 operation of the systems.

4.2. Before You Begin: Cloud Vs On-Prem VS Hybrid

For cloud solution, the typical choices are between AWS, Azure, or GCP. Cloud Solutions and typically cost-effective, scalable, and secure. This guide focuses quite a lot on AWS setup, we have Azure setup avaliable in the gitlab as well (see Azure Gitlab).

Hybrid and on-premise solutions tend to be more complex, often requiring upfront resource provisioning. For example, an agency might store data on-premise while running machine learning models in the cloud.

As of June 2025, this Playbook does not yet cover hybrid or on-premise solutions in detail. If you have questions about these setups, or any other specific configurations, please don’t hesitate to reach out to the AI Practice team.

4.3: Before You Begin: Agency’s Context and Priorities

As you read through this content, you may find that the solutions presented are not fully aligned with your agency’s specific needs, or you may still be in the process of identifying those needs. For this purpose, we’ve developed a 10-category capability framework (see Chapter 7: Ten Category Maturity Framework) to help assess both your current MLOps maturity and your envisioned future state, enabling a more strategic and tailored approach to implementation.

4.4. Concept

This chapter offers structured baseline recommendations on your MLOps setup. Users are encouraged to adapt the setup to fit their specific agency's needs and infrastructure.

Here, we establish a structured approach to MLOps, integrating GitLab CI/CD, AWS SageMaker, and Terraform to streamline machine learning operations. It provides a step-by-step guide to implementing scalable, automated, and secure MLOps workflows using cloud-native services.

While cloud-agnostic concepts are included, implementation examples focus on AWS services (SageMaker, S3, Lambda). The methodologies can be adapted to other cloud providers.

This chapter follows these Key Principles: - Reproducibility – Ensuring consistent results across environments and deployments. - Automation – Reducing manual effort through CI/CD pipelines and infrastructure as code. - Security – Embedding security best practices across the MLOps lifecycle. - Scalability – Enabling seamless scaling of ML workloads. - Cloud Flexibility – Supporting AWS-native implementations while providing guidance on adapting workflows to other cloud environments.

4.5. Get Started

Prerequisites

GCC AWS account with necessary permissions.
Terraform (>=1.3) installed.
GitLab account with CI/CD access.
Docker installed for containerized builds - — if you're using containers instead of pre-built models provided by your platform (like SageMaker).
Python (>=3.8) and ML dependencies installed.
GitLab Runners with appropriate AWS IAM permissions.

4.5. MLOps Framework

AWS-Centric MLOps Framework

Data Storage: Amazon S3
Experiment Tracking: MLflow
Feature Engineering & Feature Store: SageMaker Feature Store
Model Training: AWS SageMaker Pipelines
CI/CD Automation: GitLab CI/CD
Infrastructure as Code (IaC): Terraform
Model Deployment: AWS SageMaker Endpoints
Monitoring & Logging: Amazon CloudWatch, Prometheus + Grafana, StackOps Elasticsearch
Security: IAM roles, network security, encryption

4.6. Infrastructure & Environment Setup

Security Considerations

IAM roles & least privilege access.
S3 encryption using AWS KMS.
VPC, private subnets, and security groups.
AWS CloudTrail for auditing.

Infrastructure as Code (Terraform)

main.tf example

Steps:

git clone git@sgts.gitlab-dedicated.com:wog/gvt/dsaidquantitativ/qs-central/mlops/mlops-infra.git
cd mlops-infra
terraform init
terraform plan
terraform apply -auto-approve

Provisioned Resources: - S3: ML artifact storage - VPC: Isolated network for SageMaker workloads - Subnets: Public and private subnets for SageMaker instances - Security Group: Restricted access for SageMaker and CI/CD services - IAM Roles: Permissions for SageMaker & CI/CD - DynamoDB: Stores Terraform state - Gateway: VPC Internet/NAT Gateway for network access - Route Table: Routing configurations for VPC subnets - VPC Endpoint: Private connectivity to AWS services

4.7. CI/CD Pipeline with GitLab

GitLab CI/CD Setup .gitlab-ci.yml example

Security Considerations - Store AWS credentials & secrets in GitLab CI/CD variables. - Implement security scans such as static code analysis and dependency scan (e.g., SonarQube, Trivy). - Manage dependencies securely using private repositories or pull from Shiphats COE and NexusRepo. - Conduct regular security reviews of the findings.

4.8. Data Engineering & Model Development

Data Engineering Considerations

Data Ingestion & ETL
Data Quality Checks
Storage & Access Control: Bucket policies, IAM policies
Feature Engineering: Transforming & standardizing features for ML

Model Development & Training

Security Considerations - Restrict training data access using IAM policies. - Encrypt SageMaker model artifacts. - Store models in a private S3 bucket with limited access.

pipeline.py example

4.9. Model Deployment

Use IAM policies or API Gateway for endpoint authentication.
Deploy models in a private VPC with AWS WAF.
Implement model versioning & rollback with a model registry.

4.10. Monitoring & Observability

Detect drift on SageMaker Model Monitoring.
Log model performance in CloudWatch Metrics.
Use Prometheus + Grafana, StackOps Elasticsearch for observability.
Set up alerts in AWS CloudWatch.
Implement anomaly detection for model predictions.

4.11. Governance, Compliance, & Security

Implement fine-grained IAM policies for SageMaker and Data Engineering workflows.
Enforce audit logging across CI/CD and SageMaker pipelines.
Use AWS Config, CloudScape for compliance monitoring.
Implement model explainability and fairness.

4.12. Best Practices & Future Enhancements

Trigger model retraining processes in the event of major data or performance degradation (e.g., EventBridge)
Continuously strengthen security based on zero-trust strategy (e.g., IAM policies, security group).
Enhance approval workflows with human-in-the-loop integration.
Implement data and MLOps orchestration tools for visibility and control (e.g., Step Functions).