Overview of Guardrails

Check your organization's compliance policies when using external services

In this section, we do mention a few external services as examples. Using such external services means your data is sent to a third party. Please ensure that this is compliant with your organization's relevant policies.

1. Toxicity/Content Moderation

Content moderation is crucial for filtering out inappropriate or harmful content before it's processed or returned by the LLM. While most state-of-the-art LLMs have built-in safety features through their alignment process, having an additional moderation layer enhances security.

Popular options include:

OpenAI's Moderation API
AWS Bedrock Guardrails
Azure AI Content Safety
Open source models like LlamaGuard by Meta and ShieldGemma by Google
Mistral's moderation API

These guardrails tend to have their taxonomy of what is considered harmful content. These categories are typically defined in the documentation.

1.1 Localised Content Moderation

Generic moderation models may not be sufficiently localized for specific contexts. For example, LionGuard was developed specifically for Singapore-specific content moderation. Further details can be found here.

2. Personal Identifiable Information (PII)

We do not want to pass PII to LLMs, especially when the LLM is accessed via an external managed service.

To detect PII, we can use:

Cloak - GovTech's dedicated internal service for comprehensive and localised PII detection (e.g., names, addresses). Direct integration with Sentinel API is coming soon.
Presidio - Open source tool that identifies various PII entities like names, phone numbers, addresses
Custom regex patterns for basic PII detection

3. Jailbreak/Prompt Injection

As models evolve, jailbreak and prompt injection techniques become increasingly sophisticated. Applications should be robust against common attack patterns.

PromptGuard is a lightweight 86M parameter model specifically for detecting jailbreaks/prompt injections. We have integrated this into our Sentinel API.
Lakera offers an API endpoint to detect prompt injections; however, not clear what/how effective the model is
deberta-v3-base-injection is a model finetuned on jailbreaks/prompt injections; however, it may be outdated
ProtectAI - multi-stage prompt injection detection framework relying on continually updated database of prompt injections and LLM-based detector; however, may be expensive and slow to run
Perplexity heuristics - perplexity-based heuristics/rules for detecting jailbreaking templates with adversarial prefixes/suffixes

Tip: Input Validation and Sanitization

Beyond having a separate guardrail model, it is also important to design your application in a way that is robust against prompt injection attacks. This includes:

Using structured inputs instead of free-form text
Classification for input validation (e.g., you have a free textbox for users to enter their resume. You can use a LLM to classify if the input is indeed a valid resume or not.)

This is an evolving area

This is an evolving area, and new jailbreak techniques routinely emerge. As models are typically trained on known jailbreak patterns, they may be susceptible to these new jailbreak techniques. Nevertheless, it is still a good idea to have a guardrail model to detect common jailbreak attempts.

4. Off-Topic

Beyond harmful content, it's important to detect and filter irrelevant queries to maintain application focus. We call such queries as "off-topic".

Off-topic

Approaches include:

Zero-shot/few-shot classifiers to detect relevance against system prompt. This approach, however, suffers from lower precision (i.e., many valid queries are incorrectly classified as off-topic).
Using a custom topic classifier guardrail from Amazon Bedrock Guardrails or Azure AI Content Safety. To use this approach, however, you need to define your own taxonomy of what is considered off-topic and/or provide custom examples for model training.

Noting these limitations, we trained our own custom off-topic guardrail model that works zero-shot. It classifies if a given prompt is off-topic based on the system prompt. Further details can be found in our blog post here.

5. System-Prompt Leakage

We typically include a system prompt in our Sentinel API to guide the LLM's behaviour. This system prompt usually contains the rules that the LLM must follow. Exposing it to the user may not be desirable as it may reveal sensitive information or allow the user to better manipulate the LLM's behaviour.

To detect if the system prompt is leaked, we can use:

Word overlap analysis
Semantic similarity checks

Here is an example of how we could use simple keyword overlap analysis to detect if the system prompt is leaked.

This section is under development

This section is under development. We will add more details here soon.

6. Hallucination and factuality

Ensuring LLM outputs are grounded in facts and the provided context improves reliability. Techniques include:

Comparing responses against a reference (likely retrieved)
Knowledge graph validation
Citation checking

Of the above, the first technique is most popular, particularly in the RAG setting. There are currently many tools available for doing so (see GovTech AIP's RAG Playbook for an assessment), including:

There is also a parallel track of research generally known as reference-free hallucination detection, which does not require a reference or source to verify claims. This is based on the intuition that LLMs may exhibit tell-tale behaviours when hallucinating, much like humans sweating when lying. There are three main approaches:

Approach	Description	Examples
Sampling-based	Prompting the LLM to respond multiple times and evaluating the consistency of the responses; however, this can be computationally costly	- Self-evaluation - SelfCheckGPT - Semantic-aware cross-check consistency (SAC³) - Cross-examination - CleanLab
Probability-based	Examining and aggregating the probabilities of tokens generated, reframing hallucination detection as uncertainty estimation; however, this requires access to token probabilities, which close APIs do not provide	- Token probability - Claim Conditioned Probability - Semantic Entropy - LM-Polygraph
Model-based	Finetuning new models to detect hallucination	- Lynx

Latency and cost

When using these evaluators at inference time, latency and cost may be of concern, especially when the response is long and has to be broken down into multiple claims/statements. Most methods rely on multiple calls to an evaluator LLM, compounding latency. An alternative method is to use light weight natural language inference (NLI) models to detect entailment. Another option is to simply provide citations, allowing end-users to perform reference and fact verification on their own.

While closely related to hallucination, factuality refers to the accuracy of information presented in accordance to world knowledge. World knowledge can be obtained from external tools like Wikipedia or Google Search, or stored in a knowledge base that serves as a single source of truth. Verifying responses with respect to world knowledge then becomes similar to hallucination detection in the RAG setting, with this world knowledge akin to the retrieved context.

In practice, facts can exist in multiple external knowledge bases. Hence, many tools have emerged to create pipelines for fact checking and verification. This includes:

Loki - end-to-end pipeline for dissecting long texts into individual claims, assessing their worthiness for verification, generating queries for evidence search, crawling for evidence, and verifying the claims; optimised with parallelism and human-in-the-loop
Search-Augmented Factuality Evaluator (SAFE) - use LLM agents to reason and send search queries to Google Search
Grounding with Gemini - ground Gemini responses with Google Search

Tip: Prompt Design

Beyond having a separate guardrail model, it is prudent to design your prompt to minimise hallucination. This includes:

Charater role prompting
Chain of Thought/Chain of Knowledge
Instructing the model to respond "I don't know" if it is not certain
Opinion-based prompts
Counterfactual demonstrations

Tip: Decoding

Given access to model weights, it is also possible to improve decoding strategies to reduce hallucinations. This includes:

factual-nucleas sampling
context-aware decoding

7. Relevance

Beyond topical relevance, responses should be:

Contextually appropriate
Within defined scope
Aligned with user intent
Grounded in provided references

This section is under development

This section is under development. We will add more details here soon.