AI Reliability

How Octocom ensures that the AI models are reliable and trustworthy

Hallucinations

Large language model (LLM) hallucinations are a well-known problem. When a standalone LLM (from OpenAI, DeepSeek, Anthropic, etc.) is asked a question that isn’t present in its knowledge base or instructions, it’s highly likely that the model will start generating made-up information or instructions. For e-commerce brands, this poses a major risk, as the model might produce responses that put the brand in jeopardy.

A common approach to mitigate this is retrieval-augmented generation (RAG). However, this method has its limitations. Since most LLMs don’t have enough memory to store a full set of articles, the RAG method retrieves the N most relevant articles and injects them into the model’s context. This works well if the necessary article exists in the knowledge base and RAG successfully retrieves it. However, if an article is missing, the model will likely resort to hallucination.

At Octocom, we take it a step further to ensure our customers don’t have to worry about hallucinations. We’ve developed specialized hallucination-checking models that run alongside the LLMs powering our agents, flagging any information that isn’t sourced from the knowledge base.

These fact-checking models are specially trained to distinguish between general real-world knowledge and business-specific knowledge. For instance, a business-specific question might be, “How much do your vitamins cost?” whereas a general knowledge question could be, “What are the benefits of vitamin C?” It’s a careful balancing act—leveraging LLMs’ vast knowledge while minimizing the chances of fabricated responses.

Most importantly, these fact-checking settings can be fine-tuned and customized for each brand. This allows us to adjust the fact-checking model in cases where it’s too restrictive or too lenient.

Another crucial factor in boosting LLM reliability is simply using the best models available from frontier labs. Whenever a new LLM surpasses the current state-of-the-art, we adopt it within days—only after subjecting it to an extensive set of internal tests. The smarter the model, the less likely it is to hallucinate out of the box.

In summary, Octocom mitigates hallucinations through:

• RAG (retrieval-augmented generation)

• Specialized fact-checking models that flag hallucinations

• Using the best available LLMs

Reliable workflows

While large language models are incredibly impressive at following instructions for simple tasks, their effectiveness quickly deteriorates as complexity increases. The moment you introduce multi-step instructions or scale beyond a dozen scenarios, performance begins to break down. Simply put, the model’s ability to follow instructions decreases with each additional rule or task. This is why standalone LLMs can only handle basic customer service tasks out of the box.

At Octocom, we’ve solved this issue by adopting an agentic architecture. In LLM systems, an agentic approach enables multiple specialized LLM models to work together in a collaborative framework—breaking down tasks and executing them more efficiently.

When configuring bots at Octocom, we decompose each task into separate workflows. Each workflow includes a set of possible actions, text-based instructions tailored to the bot, and multiple variations that adapt to different conditions. Once a customer submits an inquiry, a workflow-picking agent selects the most relevant workflows and passes them downstream to specialized agents for execution.

For critical tasks, such as hallucination detection, we take reliability a step further by deploying multiple copies of the same agent to independently make a decision. These agents then vote on the best response, minimizing the risk of one-off mistakes. This democratic approach significantly improves accuracy by reducing the impact of individual model errors.

More importantly, because each inquiry is broken down into a series of small, discrete steps, we gain fine-grained visibility into the system’s reasoning at every stage. This means we can pinpoint specific agent decisions, fine-tune problematic steps, and continuously update the system to reduce future mistakes. With this approach, Octocom’s bots learn and improve with every interaction, ensuring smarter, more reliable AI-driven customer service over time.

Reliable action execution

Another common challenge with LLMs is the unreliable execution of actions. Actions are a critical component of any customer service system—ultimately determining how useful an agent is. They’re responsible for both retrieving information (e.g., products, orders) and performing modifications (e.g., order changes, refunds, newsletter management). However, much like hallucinations, a standalone LLM frequently makes errors, such as calling the wrong action, forgetting to call an action, or executing steps out of order.

At Octocom, we’ve developed a system called Think Before You Speak (TBYS) to significantly improve the reliability of action execution. The name says it all—before generating a response, a specialized agent plans and reasons through the problem. This system relies on specialized reasoning models—think OpenAI’s o1 or DeepSeek’s R1, but fine-tuned specifically for executing correct actions.

One of the biggest advantages of TBYS—beyond enabling the model to think and plan—is that it allows the model to verify action outcomes before responding. While this might seem obvious, many AI systems suffer from a fundamental issue: they generate a response first and only afterward realize that an action has failed. TBYS eliminates this problem by ensuring the model confirms action success before composing its reply. This process closely mirrors how humans operate—we first check the data, then take action, and only after that do we communicate the results.

In addition to TBYS, we enhance reliability through a workflow-based approach. Each workflow defines a set of allowed actions and can impose conditions, such as “this action must occur before that action”. This allows traditional deterministic software to further guide and constrain the LLM, reducing the likelihood of errors.

Moreover, limiting the LLM’s choices by providing a smaller, more relevant set of actions and specialized instructions significantly reduces mistakes. Again, this mirrors human behavior—when overloaded with too many options and instructions, mistakes become more frequent.

Disclaimer: Despite significant advances in minimizing hallucinations and mistakes, they can still occur. However, their magnitude and impact are significantly reduced, with only minor errors occasionally slipping through. Additionally, we continuously fine-tune our models for each brand, ensuring ongoing improvement over time.

PreviousSecurity Overview NextAI Security

Last updated 8 months ago