AI Security

How Octocom ensures that AI models do not compromise security

What is jailbreaking?

When discussing large language models (LLMs), the most common threat vector that arises is jailbreaking. Simply put, jailbreaking is an attack where a user convinces the model to leak its instructions or alter its intended behavior.

Jailbreaking isn’t the only security risk in AI automation systems, but it remains the primary concern when analyzing AI security threats. Once a model is compromised, it can essentially be manipulated to follow any command, and there’s often little that can be done to mitigate the damage.

Never trust the model

Octocom's approach to mitigating threats from jailbreaking boils down to a simple principle: never trust the model. Given that new LLM exploits emerge almost every week, it would be naive to claim that anyone has fully solved the jailbreaking problem.

However, by assuming that an LLM cannot be trusted and designing systems accordingly, the negative impact of jailbreaking can effectively be reduced to zero.

Scope minimization

The first step in not trusting the model involves minimizing its scope. In practical terms, for e-commerce businesses, this means ensuring that the information provided to the LLM is either entirely public or scoped to the customer.

If the LLM is pressured into revealing its prompts, it will only disclose information that the customer already knows or that is publicly available. Even if a determined attacker successfully jailbreaks the model, the only information they can extract is non-sensitive.

Non-deterministic models, deterministic software

While LLMs are non-deterministic, the data they process comes from deterministic programs designed by expert human developers. Just as we trust banks to display only the financial details relevant to an individual account holder, we ensure that LLMs receive only the scoped customer information they need.

For instance, suppose an attacker successfully jailbreaks an LLM and attempts to retrieve the address of a specific order. The LLM itself does not possess order-specific information unless the user has been authenticated. Therefore, even if the jailbroken model follows the attacker's instructions, it remains effectively useless to them.

Beyond information retrieval, AI bots can also execute actions. As detailed in the AI reliability article, actions are strictly scoped to specific workflows, which are governed by deterministic conditions. These conditions ensure that the bot can only perform approved actions, even if the LLM is compromised.

For example, imagine an attacker places an order and attempts to refund themselves 99% of the order value. They exploit the latest jailbreaking method, persuading the LLM to comply. However, if the e-commerce business only allows 20% partial refunds for orders delayed by over a week, then—even with complete control over the LLM—the attacker cannot execute an unauthorized refund. The LLM lacks direct access to refund actions, as these are governed by deterministic software, ensuring the security of the system.

Human oversight for unverifiable workflows

In certain cases, it is impossible to fully trust an LLM to autonomously execute workflows due to the risk of jailbreaking. While these cases are rare, one key example is refunds for quality issues. Suppose an attacker convinces an LLM that their item is damaged, and the model agrees to issue a refund. Since LLMs can be jailbroken, we cannot rely on them for such decisions, as damage assessment cannot always be verified deterministically (e.g., analyzing an image to confirm product damage is infeasible with traditional non-AI programs).

In these scenarios, human oversight becomes necessary. The LLM can manage the entire conversation, but the final refund approval is handled by a human, who reviews the images and summary before clicking approve or reject. This hybrid approach eliminates the risk of fraudulent refunds while reducing the workload by 95%, streamlining the quality complaint process without compromising security.

Conclusion

Due to the principles outlined above, companies place trust in our AI systems to autonomously handle a wide range of customer inquiries. The vast majority of support workflows can be fully automated without any security risks. A small minority of support workflows requires human oversight. Ultimately, given sufficient system access and fully digitized processes, there is no reason— from a security perspective— not to entrust AI models with handling the majority of customer service.

PreviousAI Reliability NextCustomer Authentication

Last updated 4 months ago