Capability Development Areas

A critical step in deploying AI agents is carefully selecting and scoping use cases to match the agent's capabilities and avoid over- or under-engineering. This requires a thorough understanding of the task's complexity, the required level of automation, and the potential benefits of using an agentic approach. The goal is to strike a balance between leveraging the agent's flexibility and minimising unnecessary complexity and associated costs. Capability development can be done in the following areas to allow GovTech product teams and agencies to better leverage AI agents.

Use Case Understanding

First, capability development should focus on meticulously analysing potential use cases in order to separate candidates that can benefit from agentic technology from candidates that cannot. Strategies can be developed for high-potential use cases, as shown below, to transition from a traditional workflow to an agentic one:

Task Decomposition: Start by breaking down potential use cases into their constituent steps. This will help to identify the level of complexity involved and determine whether an agentic approach is truly necessary.
Workflow Mapping: Create visual diagrams illustrating the current workflow and the proposed agent-driven workflow. This makes it easier to compare the complexity of each approach and identify potential bottlenecks or redundancies.
ROI Analysis: Quantify the potential benefits of using an agent, such as reduced processing time, improved accuracy, or increased user metrics like customer satisfaction. Compare these benefits to the costs associated with developing and maintaining the agent, including development time, compute resources, and security measures.
Gradual Rollout: Consider a phased approach, starting with simpler use cases and gradually expanding to more complex ones. This allows you to gain experience and refine your approach before tackling more challenging projects.

Evaluation

After appropriate use cases have been identified, the next step would be to evaluate the systems so that improvements over the status quo can be measured. This would involve the following:

Clear Success Criteria Definition: Establish clear, measurable goals for the agent to achieve. This could include metrics such as accuracy, completion rate, or time to completion.
Test Case Development: Create a diverse set of test cases that cover a range of scenarios and potential edge cases. These test cases should be representative of the types of tasks the agent will be performing in the real world.
Human-in-the-Loop Validation: Incorporate human review into the evaluation process. This allows you to identify errors or inconsistencies that might be missed by automated metrics.
Key Performance Indicators (KPIs) Tracking: Monitor the agent's performance over time and track key metrics to identify trends and areas for improvement.

Latency Management

Even if the evaluation results show that an agentic system is better, the approach might not be adopted as agentic systems typically take a longer time to produce results. This is a key issue for systems which require (near) real-time responses. Capability development efforts can be done in latency management to reduce this friction in adoption. Specific areas which can be studied are as follows:

Tool Selection Optimisation: Carefully select tools that are efficient and responsive. Avoid tools that are known to be slow or unreliable.
Caching: Cache frequently accessed data to reduce the need to repeatedly query external systems.
Operations Parallelisation: Where possible, execute tasks in parallel to reduce overall processing time.
Time Limits: Impose time limits on each step of the agent's workflow to prevent it from getting stuck in endless loops.
Strategic Reflection: Limit the agent's reflection and iteration to when it is truly needed. The reflection is often the part that makes agents really useful, but too much of it may be overkill.

Ablation Studies for Tool Usage

Another area of capability development involves performing experiment and analysis to compare agent performance with different sets of tools, and conduct ablation studies to identify critical tools. This involves systematically testing different combinations of tools and evaluating their impact on the agent's performance.

The steps taken to perform this can be as follows:

Set up a testing environment with benchmark tasks that are representative of real-world government use cases.
Establish a baseline performance with the current toolset, measuring the agent's ability to complete the benchmark tasks accurately and efficiently.
Experiment with different tool combinations, adding or removing tools to see how they affect the agent's performance.
Conduct ablation studies by systematically removing each tool from the agent's inventory and measuring the impact on performance.
Analyse results to identify impactful tools, determining which tools have the biggest impact on performance and which tools can be removed without significantly affecting the agent's capabilities.

Fine-tuning

As elaborated in the Fine-tuning subsection above, LLMs can be fine-tuned to improve their planning and reasoning capabilities, and in their usage of tools. Fine-tuning can be done to improve LLMs’ agentic capabilities with regard to Singapore government specific use cases. For example, datasets could be curated to train LLMs on how to plan and execute complex tasks like procurement processes and adherence to security guidelines (e.g. IM8). Fine-tuning can also be done to improve the tool calling capability of LLMs on government-specific APIs like OneMap APIs, URA Data Service APIs, and Data.gov.sg.

Security

Security is paramount when deploying AI agents, especially in sensitive government applications. Agents can be vulnerable to various attacks that can compromise their integrity, allow malicious actors to gain control, or lead to unintended and potentially harmful actions. Robust security measures are crucial to mitigate these risks and ensure the safe and reliable operation of AI agents. Some areas to look into, for various sub-domains, are listed below:

Defensive Prompt Engineering:
- Input Validation: Implement strict validation rules to filter out potentially malicious input based on predefined patterns or keywords.
- Output Filtering: Develop filters to sanitise the agent's output and remove any injected code or instructions before they are executed.
- Prompt Hardening: Design the agent's prompts to be as unambiguous and resistant to manipulation as possible.
- Quote/Escape Input: Quote or escape raw user input to prevent the LLM from interpreting it as code.
Sandboxing Code Execution Tools:
- Access Restriction: Run code execution tools in sandboxed environments that limit their access to system resources and sensitive data.
- Least Privilege Principle Implementation: Grant code execution tools only the minimum necessary permissions to perform their intended functions.
- Code Review and Static Analysis: Implement code review and static analysis tools to identify potential vulnerabilities in the agent's code before deployment.
- Dynamic Analysis: Incorporate dynamic analysis techniques, like fuzzing, to proactively detect security flaws.
Requiring Human Oversight for Risky Operations:
- Explicit Approval: Require explicit human approval before executing risky operations, such as database updates, financial transactions, or policy changes.
- Role-Based Access Control: Implement granular access control levels to restrict the agent's ability to perform sensitive operations based on the user's role and permissions.
- Thresholds and Limits: Set thresholds and limits on the agent's actions to prevent it from exceeding its authorised scope.
Defining Access Control Levels:
- Granular Permissions: Implement granular access control levels that restrict the agent's ability to perform specific actions based on the user's role and context.
- Dynamic Permissions: Implement dynamic permissions that are granted or revoked based on the agent's current task and the user's authentication status.
- Principle of Least Privilege: Grant the agent only the minimum necessary permissions to perform its intended functions.
- Logging and Auditing: Keep an accurate audit log to enable post-incident investigation.

Optimising AI Infrastructure for Agents

Having the right amount of infrastructure when you need it is important in ensuring performance while saving costs; it is not practical to keep the LLMs, agents and tools readily available 24/7 if it’s only used 5% of the time, and may also not make sense to have to wait twenty minutes every time the workflow is triggered. With the proliferation of tools and smaller models backing these tools, there is also a need to better utilise large GPUs. Some areas that can be explored are:

Intelligent and dynamic orchestration:
- Efficient on-demand provisioning: Ability to dynamically provision infrastructure in a just-in-time (JIT) manner with minimal startup time, ensuring that resources are available when needed and shut down when idle. This reduces waste and ensures responsiveness when workflows are triggered.
- Optimal resource usage: Orchestration layer should allow for GPU slicing to serve multiple models simultaneously while still allowing for autoscaling capabilities. This ensures that GPUs are fully utilised, improving cost efficiency and throughput without having to dedicate an entire GPU to a single model or agent.
Secure and trusted tool management:
- Protecting the supply chain: Similar to the usage of software packages today, there should be a trusted source of tools and/or agents that are verified to be functioning as intended. This includes ensuring that these tools are kept up-to-date, safe from vulnerabilities, and meet compliance standards, reducing the risk of running untrusted or outdated components.
- Logging and Auditing: Implementing comprehensive logging and auditing mechanisms to track the interactions between agents, tools, and infrastructure. This ensures traceability of all actions, helps in identifying potential security issues, and aids in compliance by maintaining logs that can be reviewed for errors, performance issues, or malicious activity.
Standardised agentic frameworks:
- Agentic protocols: Using standardised protocols will allow for interoperability between different agents, tools, and services across platforms. This ensures that agents and tools from different vendors or ecosystems can communicate and work together seamlessly, fostering a more flexible and scalable architecture that can integrate with various external systems.
- Distributed Agent Management: Distributed agent management frameworks should be implemented for agents to be dynamically deployed across various infrastructure types, cloud instances, or on-premise servers. A distributed management system ensures that agents can scale across multiple locations while maintaining a consistent level of performance and minimising latency, and supports the ability to rapidly deploy new agents or update existing ones without causing downtime or service disruption.

TL;DR

Developing capability in areas inhibiting the adoption of agentic AI will allow us to optimise the use this technology. This section serves as the blueprint for GovTech's AI Practice Group to develop our capabilities in the key areas.