Chapter 5A: Setup

In this chapter, we outline the process of setting up the case study project on your local development environment from SHIPHATS GitLab. By following the steps below, you will learn how to fork the resale price prediction project and (optionally) set up a local development environment on a SEED device. Finally, we will summarize the key takeaways.

In addition, we provide an overview of the data used for illustration.

5A.1. High-Level Description of the Project and File Structure

Our case study centers on HDB resale price prediction and the project directory below provides essential scaffolding for CI/CD and code organization.

Here is the layout of the project's file structure, illustrating how the code and resources are organised. However, you are free to rename or reorganise the directories according to the unique needs of your project and agency requirements:

|-- img
|   |-- pipeline-full.png
|-- model_code
|   |-- __init__.py
|   |-- evaluate.py
|   |-- lcc-init-script.sh
|   |-- mlflow_logging.py
|   |-- pipeline.py
|   |-- preprocess.py
|-- pipelines
|   |-- __init__.py
|   |-- __version__.py
|   |-- _utils.py
|   |-- get_pipeline_definition.py
|   |-- run_pipeline.py
|-- tests
|   |-- test_pipelines.py
|-- .coveragerc
|-- .gitignore
|-- .gitlab-ci.yml
|-- .pydocstylerc
|-- CONTRIBUTING.md
|-- LICENSE
|-- README.md
|-- config.py
|-- setup.cfg
|-- setup.py
|-- tox.ini

Below is a breakdown of the file structure and a short description for each file:

model_code/

Contains the core logic for data preprocessing, model training, evaluation, and logging:

pipeline.py – Defines the ML pipeline combining preprocessing, model training, evaluation, and logging.
preprocess.py – Handles data transformation and feature engineering.
evaluate.py – Contains evaluation logic for model performance metrics.
mlflow_logging.py – Manages experiment tracking and model artifact logging via MLflow.
lcc-init-script.sh – Shell script used for cluster or cloud environment initialization.

pipelines/

Encapsulates pipeline orchestration and utilities:

run_pipeline.py – Main script to run the full ML pipeline.
get_pipeline_definition.py – Assembles and returns the pipeline structure/configuration.
_utils.py – Helper functions supporting the pipeline logic.
__version__.py – Tracks the version of the pipeline for reproducibility and deployment.

tests/

Contains test cases to validate functionality and enable CI checks:

test_pipelines.py – Unit tests for verifying pipeline integrity and behavior.

CI/CD & Configuration Files

.gitlab-ci.yml – Defines GitLab CI/CD stages like model building, approval, testing, and deployment.
tox.ini – Environment setup and testing tool configuration.
setup.cfg, setup.py – Define package metadata, dependencies, and install behavior.
.coveragerc – Configures code coverage reports for testing.
.pydocstylerc – Linting configuration for Python docstrings.
config.py – Central configuration file for parameters like paths and model settings.

Documentation & Metadata

README.md – Provides an overview, installation instructions, and usage guide.
CONTRIBUTING.md – Contributor guidelines for maintaining code quality and standards.
LICENSE – Specifies the legal terms for use and distribution of the project.

5A.2. Forking the Project

Before you start customizing the project code or modifying the CI/CD pipeline for your use case, it is recommended to fork the project repository into your own GitLab namespace. This allows you to maintain an independent copy of the code while preserving the option to pull updates from the original repository.

Locate the Template Repository
Navigate to the SHIPHATS GitLab page hosting the case study project: Project Repository
Fork the Repository
Click on the Fork button (typically found near the top-right of the GitLab interface) and choose a namespace (your personal namespace or your team’s group) to house your new project.
Verify Fork
Once the process completes, you will have a separate repository URL. You can clone or pull this fork locally.

Note: If your organization uses private repos or different GitLab configurations, the fork option might appear differently. Consult internal documentation if necessary.

5A.3. (Optional) Setting Up a Local Development Environment on a SEED Device

Some users may have a SEED device for secure local development with Internet access. The following steps outline how to configure local authentication and GPG commit signing.

Configuring SHIPHATS GitLab Authentication

Refer to the following guide on how to set up your personal access token for your SEED device: Personal Access Token Setup

Setting Up GPG Verification

GPG verification helps ensure the integrity of your commits. Refer to the following guide on setting up GPG verification on your SEED device: GPG Verification Setup

5A.4. Remarks

In this chapter, we gone through the steps for forking the case study project on SHIPHATS GitLab for independent development. For teams with SEED devices, we provided optional steps to configure local development environments, including GitLab authentication and GPG verification.

Appendix: Data Description

This section provides details about the HDB resale price dataset that will be used in this case study and outlines how to store and manage this dataset on Amazon S3 in accordance with MLOps best practices.

The dataset is owned by the Housing & Development Board (HDB) and covers resale transactions of HDB apartments from January 2017 onwards, providing valuable insights into historical housing trends and pricing within Singapore’s public housing market.

Data Source and Accessibility
The dataset is publicly available on data.gov.sg. Users can freely download the data under the Open Data Licence, which permits personal and commercial use. In the case that the link provided is no longer working, you can also find a snapshot of the dataset (in CSV format) in the project directory.

Update Frequency
HDB updates this dataset on a quarterly basis. Teams integrating data of similar nature into their pipelines should schedule regular data ingestion jobs (e.g., every three months) to keep the pipeline in sync.

Licence
The dataset is distributed under the Open Data Licence, which allows for personal or commercial use at no cost. The licence ensures that organizations can leverage this dataset for both research and production use without incurring additional costs or legal restrictions.

Data Resolution
Each record in the dataset corresponds to a single resale transaction, making it granular enough for detailed analyses and modeling.

Data Features
- month: Month of the resale transaction (in YYYY-MM format).
- town: Town where the property is located.
- flat_type: Type of flat (e.g., 1 ROOM, 2 ROOM, 3 ROOM, 4 ROOM, 5 ROOM, EXECUTIVE, MULTI-GENERATION).
- block: The block number of the apartment.
- street_name: The street name where the apartment is located.
- storey_range: The storey range (in intervals of three) where the apartment is situated.
- floor_area_sqm: The floor area (in square meters) of the apartment.
- flat_model: The model of the flat (e.g., Model A, Improved, New Generation, etc).
- lease_commence_date: The year in which the apartment’s lease commenced.
- remaining_lease: The number of years left before the apartment's lease expires.
- resale_price: The target variable that we are trying to predict, representing the transaction price of the resale apartment.

Notes

The approximate floor area for each apartment may include additional space such as recess areas, purchased areas, or extra spaces from HDB’s upgrading programs.
These transactions exclude certain cases that might not reflect the full market price, which helps maintain data consistency and quality.