Overview of Guardrails
Check your organization's compliance policies when using external services
In this section, we do mention a few external services as examples. Using such external services means your data is sent to a third party. Please ensure that this is compliant with your organization's relevant policies.
1. Toxicity/Content Moderation
Content moderation is crucial for filtering out inappropriate or harmful content before it's processed or returned by the LLM. While most state-of-the-art LLMs have built-in safety features through their alignment process, having an additional moderation layer enhances security.
Popular options include:
- OpenAI's Moderation API
- AWS Bedrock Guardrails
- Azure AI Content Safety
- Open source models like LlamaGuard by Meta and ShieldGemma by Google
- Mistral's moderation API
These guardrails tend to have their taxonomy of what is considered harmful content. These categories are typically defined in the documentation.
1.1 Localised Content Moderation
Generic moderation models may not be sufficiently localized for specific contexts. For example, LionGuard was developed specifically for Singapore-specific content moderation. Further details can be found here.
2. Personal Identifiable Information (PII)
We do not want to pass PII to LLMs, especially when the LLM is accessed via an external managed service.
To detect PII, we can use:
- Cloak - GovTech's dedicated internal service for comprehensive and localised PII detection (e.g., names, addresses). Direct integration with Sentinel API is coming soon.
- Presidio - Open source tool that identifies various PII entities like names, phone numbers, addresses
- Custom regex patterns for basic PII detection
3. Jailbreak/Prompt Injection
As models evolve, jailbreak and prompt injection techniques become increasingly sophisticated. Applications should be robust against common attack patterns.
PromptGuardis a lightweight 86M parameter model specifically for detecting jailbreaks/prompt injections. We have integrated this into our Sentinel API.- Lakera offers an API endpoint to detect prompt injections; however, not clear what/how effective the model is
deberta-v3-base-injectionis a model finetuned on jailbreaks/prompt injections; however, it may be outdatedProtectAI- multi-stage prompt injection detection framework relying on continually updated database of prompt injections and LLM-based detector; however, may be expensive and slow to run- Perplexity heuristics - perplexity-based heuristics/rules for detecting jailbreaking templates with adversarial prefixes/suffixes
Tip: Input Validation and Sanitization
Beyond having a separate guardrail model, it is also important to design your application in a way that is robust against prompt injection attacks. This includes:
- Using structured inputs instead of free-form text
- Classification for input validation (e.g., you have a free textbox for users to enter their resume. You can use a LLM to classify if the input is indeed a valid resume or not.)
This is an evolving area
This is an evolving area, and new jailbreak techniques routinely emerge. As models are typically trained on known jailbreak patterns, they may be susceptible to these new jailbreak techniques. Nevertheless, it is still a good idea to have a guardrail model to detect common jailbreak attempts.
4. Off-Topic
Beyond harmful content, it's important to detect and filter irrelevant queries to maintain application focus. We call such queries as "off-topic".

Approaches include:
- Zero-shot/few-shot classifiers to detect relevance against system prompt. This approach, however, suffers from lower precision (i.e., many valid queries are incorrectly classified as off-topic).
- Using a custom topic classifier guardrail from Amazon Bedrock Guardrails or Azure AI Content Safety. To use this approach, however, you need to define your own taxonomy of what is considered off-topic and/or provide custom examples for model training.
Noting these limitations, we trained our own custom off-topic guardrail model that works zero-shot. It classifies if a given prompt is off-topic based on the system prompt. Further details can be found in our blog post here.
5. System-Prompt Leakage
We typically include a system prompt in our Sentinel API to guide the LLM's behaviour. This system prompt usually contains the rules that the LLM must follow. Exposing it to the user may not be desirable as it may reveal sensitive information or allow the user to better manipulate the LLM's behaviour.
To detect if the system prompt is leaked, we can use:
- Word overlap analysis
- Semantic similarity checks
Here is an example of how we could use simple keyword overlap analysis to detect if the system prompt is leaked.
This section is under development
This section is under development. We will add more details here soon.
6. Hallucination and factuality
Ensuring LLM outputs are grounded in facts and the provided context improves reliability. Techniques include:
- Comparing responses against a reference (likely retrieved)
- Knowledge graph validation
- Citation checking
Of the above, the first technique is most popular, particularly in the RAG setting. There are currently many tools available for doing so (see GovTech AIP's RAG Playbook for an assessment), including:
- RAGAS
- TruLens
- DeepEval
- AWS Bedrock Guardrails Contextual Grounding
- Azure AI Content Safety Groundedness
There is also a parallel track of research generally known as reference-free hallucination detection, which does not require a reference or source to verify claims. This is based on the intuition that LLMs may exhibit tell-tale behaviours when hallucinating, much like humans sweating when lying. There are three main approaches:
| Approach | Description | Examples |
|---|---|---|
| Sampling-based | Prompting the LLM to respond multiple times and evaluating the consistency of the responses; however, this can be computationally costly | - Self-evaluation - SelfCheckGPT - Semantic-aware cross-check consistency (SAC3) - Cross-examination - CleanLab |
| Probability-based | Examining and aggregating the probabilities of tokens generated, reframing hallucination detection as uncertainty estimation; however, this requires access to token probabilities, which close APIs do not provide | - Token probability - Claim Conditioned Probability - Semantic Entropy - LM-Polygraph |
| Model-based | Finetuning new models to detect hallucination | - Lynx |
Latency and cost
When using these evaluators at inference time, latency and cost may be of concern, especially when the response is long and has to be broken down into multiple claims/statements. Most methods rely on multiple calls to an evaluator LLM, compounding latency. An alternative method is to use light weight natural language inference (NLI) models to detect entailment. Another option is to simply provide citations, allowing end-users to perform reference and fact verification on their own.
While closely related to hallucination, factuality refers to the accuracy of information presented in accordance to world knowledge. World knowledge can be obtained from external tools like Wikipedia or Google Search, or stored in a knowledge base that serves as a single source of truth. Verifying responses with respect to world knowledge then becomes similar to hallucination detection in the RAG setting, with this world knowledge akin to the retrieved context.
In practice, facts can exist in multiple external knowledge bases. Hence, many tools have emerged to create pipelines for fact checking and verification. This includes:
- Loki - end-to-end pipeline for dissecting long texts into individual claims, assessing their worthiness for verification, generating queries for evidence search, crawling for evidence, and verifying the claims; optimised with parallelism and human-in-the-loop
- Search-Augmented Factuality Evaluator (SAFE) - use LLM agents to reason and send search queries to Google Search
- Grounding with Gemini - ground Gemini responses with Google Search
Tip: Prompt Design
Beyond having a separate guardrail model, it is prudent to design your prompt to minimise hallucination. This includes:
- Charater role prompting
- Chain of Thought/Chain of Knowledge
- Instructing the model to respond "I don't know" if it is not certain
- Opinion-based prompts
- Counterfactual demonstrations
Tip: Decoding
Given access to model weights, it is also possible to improve decoding strategies to reduce hallucinations. This includes:
- factual-nucleas sampling
- context-aware decoding
7. Relevance
Beyond topical relevance, responses should be:
- Contextually appropriate
- Within defined scope
- Aligned with user intent
- Grounded in provided references
This section is under development
This section is under development. We will add more details here soon.