Building AI Agents That Work – Challenges and Best Practices

Part 2 of our series on the rise of AI agents

Cast your mind back to early 2024, when every tech company started announcing their “AI agents.”

The demos were impressive – agents that could book flights, write code, and manage complex workflows with minimal human input. Fast-forward to today, and agentic AI has progressed – though whether or not it matches, exceeds, or lags behind expectations is a matter of debate.

Generally speaking, while reasoning models like o3 have shown remarkable problem-solving abilities, building agents that work in production is a completely different challenge.

It’s not about having the smartest model – it’s about understanding what you’re building, why most attempts fail, and how to avoid the traps that have derailed countless projects.

So, what does it take to build agents that work? And why do most teams get it so wrong from the start?

A Reality Check: What Research Reveals About Agentic AI

While the hype around agentic AI continues building, cutting-edge research paints a more realistic picture of agentic AI in real-life production environments.

Let’s give a quick rundown of some of the greatest challenges to achieving agentic AI to set the scene for how to break through them and deliver reliable, high-performing systems:

The Performance Gap

Carnegie Mellon’s latest benchmark study tested leading agentic systems on common workplace tasks and found success rates hovering around 30-35%. Even Google’s Gemini 2.5 Pro could only complete 30.3% of multi-step tasks autonomously. For context, this means roughly 7 out of 10 agent deployments fail to deliver what users expect.

A recent analysis by UC Berkeley Sutardja Center reveals why these systems can struggle: “The highly complex nature of multi-step, multi-agent reasoning expands the attack surface of agentic AI. We can expect compromised agentic execution either due to hallucination or adversarial attack, or even their own scheming.”

The Trust Crisis

Trust remains the fundamental barrier. First Page Sage’s comprehensive user study found that manual search results were trusted 20 points higher than agentic results – a gap that widened to 37 points among technical users who understand AI limitations.

This trust deficit explains why Gartner predicts over 40% of agentic AI projects will be cancelled by 2027 due to “escalating costs, unclear business value, or inadequate risk controls.”

The research consistently points to three critical modes of failure:

Hallucination cascade – where errors compound across multi-step reasoning chains
Adversarial vulnerability – expanded attack surfaces in complex, multi-agent workflows
Trustworthiness degradation – user confidence erosion from unpredictable outputs

Understanding these patterns is essential for teams building systems that need to work reliably, not just impressively in demos.

Why Building AI Agents is So Challenging

The leap from transformer-based chatbots to agentic AI is quantum. You’re not just building a smarter chatbot; you’re creating systems that need to reason through uncertainty, coordinate multiple tools, and make decisions with real-world consequences.

Some of the primary adoption challenges for agents in 2025 are over-ambition, engineering complexity, and classic data issues.

Build and Test Environments Don’t Match Reality

It has been tempting to build agents that can handle everything from customer service to strategic planning, only to discover that their systems produce inconsistent results or fail completely when they encounter real users.

In many ways, that’s a classic modelling issue – over-generalisation during training can produce favourable test results that completely fail to translate into real-world performance. This parallels the classic machine learning problem where models achieve 95% accuracy on test sets but perform poorly in production.

With agents, the issue is amplified because training data consists of idealised workflows where users ask clear questions, provide complete context, and follow logical sequences.

Real users interrupt mid-conversation, change their minds, provide contradictory information, and request items that weren’t included in the training data. The agent that smoothly handles “book me a flight to New York” in testing breaks down when a user says “Actually, can you make that Boston instead, but only if it’s cheaper, and I need to be back by Thursday for my daughter’s recital.”

Hallucination Risk

Further, LLMs are prone to hallucinations and inconsistencies, and chaining multiple AI-driven steps compounds these issues exponentially.

When working with multiple agents or tool calling, systems become very slow and expensive, which creates problems when chaining agents together into complex workflows.

Technical Considerations

Even in 2025, most GenAI tools are single-turn or stateless. You send a prompt and get a response. With larger context windows and better retrieval, these tools are easier than ever to build: you don’t manage long-term state, you don’t maintain goals over time, and costs are predictable per request.

Agents are fundamentally different. They’re expected to hold multi-turn conversations, remember prior context, use tools, and pursue user goals across many steps. Even with better models and frameworks, this introduces stubborn engineering challenges:

Context window limits – Despite 100k-token models, you can’t just dump in all history. Longer contexts increase latency and cost, and you still need chunking or retrieval strategies that can fail or lose important details.
Attention scaling – Even with linear-attention architectures, cost grows with input length, limiting real-time use for long sessions.
Memory consistency – Storing knowledge as embeddings introduces drift when models update, breaking retrieval quality and leading to contradictions.
State management – Agents need to serialise evolving goals, tool calls, and user corrections in text form that fits within context limits, a fragile process even with modern tooling.

These systems also fail opaquely. A bad answer is rarely traceable to any clear bug or rule; instead, it emerges from billions of weights and stochastic sampling. Debugging remains a data-heavy, probabilistic process, not straightforward software engineering.

Cost

Stateless GenAI calls remain cheap and predictable in 2025. But agents need multiple coordinated model calls for planning, memory updates, tool use, and conversation turns.

Even with more efficient inference, costs rise quickly, especially for high-concurrency applications.

Training also remains expensive. Fine-tuning for reliable tool use, long-term memory, and domain adaptation requires large data pipelines, human feedback, and massive GPU.

Despite more efficient methods, these costs can still exceed development budgets, and there’s no guarantee your new model generalises well.

AI Agents,Building AI Agents That Work,AI Agents That Work,Agent Data Annotation,RLHF

Above: AI compute requirements have escalated rapidly. Even as compute costs drop, more complex AI models are still relatively more costly to train. Source: Wikimedia Commons.

Ultimately, building real agents means going beyond prompt design. It demands careful system design for memory, state, retrieval, and cost control – challenges that remain deeply open even with today’s best models.

Architecture Choices That Make or Break Your Agent

Most AI agents fail not because of bad models or insufficient data, but because teams choose the wrong architectural pattern from the start.

The choice between reactive, reasoning, and multi-agent architectures determines whether your system will handle edge cases gracefully or break down when users need it most.

Reactive Architectures

These handle single interactions without memory between conversations. Each request gets processed independently – perfect for scenarios where context doesn’t matter and speed is crucial.

Main characteristics:

Stateless processing – Each request starts fresh with no memory of previous interactions
Fast response times – No context tracking means lower latency and simpler infrastructure
Predictable scaling – Easy to distribute load across multiple instances
Consistent behaviour – Same input produces similar outputs every time

Reactive architectures work brilliantly for customer FAQ systems, content generation tools, and simple classification tasks. They excel at high-volume, single-turn interactions where consistency matters more than continuity.

However, they fall short for any workflow requiring memory, multi-step processes, or learning from previous interactions.

Goal-Directed Architectures

Goal-directed systems maintain context across interactions and can break complex tasks into manageable steps. They remember what you’re working on and can plan how to achieve objectives over multiple conversations – key to the core thesis of agentic AI.

Main characteristics:

Persistent memory – Maintains context across conversations and even across sessions
Planning mechanisms – Decomposes large tasks into subtasks and tracks progress
State management – Tracks progress toward objectives and adapts when circumstances change
Tool orchestration – Coordinates multiple tools, learning which combinations work best

JPMorgan Chase’s COiN system demonstrates goal-directed architecture, tackling document analysis at enterprise scale. The bank processes around 12,000 commercial loan agreements annually – complex legal documents that lawyers traditionally spent 360,000 collective hours reviewing each year.

Their goal-directed agent maintains context across lengthy contracts, plans how to extract approximately 150 different attributes from each agreement, and coordinates multiple AI tools for text analysis, pattern recognition, and clause identification.

Multi-Agent Architectures

Multi-agent systems involve specialised agents collaborating on complex problems that exceed the ability of single agents. Companies like Sakana AI even draw inspiration from ‘swarm intelligence’ in nature, such as fish, ants, and other insects, which work together to perform complex tasks.

Main characteristics:

Role specialisation – Different agents are optimised for different domains or functions
Collaborative processing – Agents share information and coordinate actions in real-time
Fault tolerance – Failures in one agent don’t crash the entire system
Scalable complexity – Can handle enterprise-level, multi-domain problems

Sakana AI’s Multi-LLM AB-MCTS system demonstrates AI collaboration. The system coordinates different AI models – ChatGPT, Gemini, and DeepSeek – with each specialising in different tasks. One might excel at generating initial solutions, another at error detection, and a third at refinement.

When testing on the challenging ARC-AGI benchmark, the system dynamically assigns tasks based on the strengths of each model.

Above: Sakana’s multi-model AI system. Source: Sakana AI.

In one case, when o4-mini generated an incorrect solution, the system handed it to DeepSeek-R1 and Gemini-2.5 Pro, which analysed the error, corrected it, and produced the correct answer.

Data Engineering For Agents

Traditional machine learning feels straightforward – show a model thousands of cat photos labelled “cat” and it learns to recognise cats. Agent training works completely differently, and most teams don’t realise this until they’re months into failed projects.

The difference comes down to the intent of the teaching and, therefore, its mechanism:

Traditional ML teaches pattern recognition – when you see these pixel patterns, output this label.
Agent training teaches cognitive processes – when facing an uncertain situation, here’s how experts gather information, evaluate options, and make decisions under pressure.

Consider a medical diagnosis system. Traditional ML training may show thousands of examples, such as “chest pain + elevated enzymes = heart attack.” The model learns to associate symptoms with corresponding outcomes.

Agent training needs something entirely different. It needs complete examples of how doctors think through difficult cases, from initial contact to diagnostics to treatment.

This is not just a data quantity issue – it’s a different category of data that captures human reasoning processes rather than just human conclusions.

Training Data For Agentic AI

Building effective agents requires fundamentally different training data from traditional ML models. You’re not teaching pattern recognition—you’re teaching cognitive processes and professional expertise.

Medical diagnosis agents need complete case studies showing expert reasoning under uncertainty:

How radiologists distinguish between benign and malignant tumours when imaging is inconclusive
When emergency physicians order expensive tests versus waiting for additional symptoms
How specialists handle cases where multiple conditions present simultaneously
Why experienced doctors sometimes ignore protocol when patient presentation is atypical

Legal contract analysis requires examples of complete legal reasoning workflows:

How attorneys identify hidden liability clauses buried in standard language
When lawyers recommend renegotiation versus accepting calculated risks
How legal teams build fallback positions before entering high-stakes negotiations
Why contract specialists flag seemingly innocent terms that create future problems

Financial fraud detection demands investigation methodologies from expert analysts:

How investigators recognise sophisticated schemes that mimic legitimate business patterns
When analysts escalate suspicious activity versus continuing surveillance
How teams coordinate across jurisdictions when tracking international money flows
Why experienced investigators trust instinct when data appears contradictory

The key insight is that agents need to learn professional judgment, not just domain knowledge. They must understand when to break rules, when to trust incomplete information, and how experts actually think under pressure.

The Agent Data Annotation Challenge Breaking Budgets

Data labelling for agents means annotating how experts think through problems, make decisions under uncertainty, and adapt when plans go wrong. You’re not solely labelling what something is – you’re labelling cognitive processes and decision-making workflows.

This requires domain experts who understand both the subject matter and how professionals work in practice. You need:

Medical professionals who can articulate diagnostic reasoning
Lawyers who understand legal strategy development
Financial experts who can explain risk assessment methodologies

At Aya Data, we solve this by partnering directly with domain experts and established institutions to capture complete decision-making workflows.

For medical AI agents, we work with radiologists and pathologists to annotate diagnostic reasoning processes – how doctors think through ambiguous cases, handle conflicting test results, and adapt treatment plans.

Our teams collaborate with healthcare professionals from institutions like the University of Ghana Medical Centre to ensure clinical accuracy.

RLHF for Teaching Agent Reasoning

Reinforcement learning from human feedback (RLHF) has become essential for training agents because it teaches decision-making processes rather than just outcomes. RLHF works by training a reward model on human preferences, then using reinforcement learning to optimise agent behaviour according to that reward signal.

However, human feedback data is expensive and creates a significant scalability bottleneck. The need to gather firsthand human input can create a costly bottleneck that limits the scalability of the RLHF process.

You ideally need to recruit genuine domain experts – doctors for medical agents, experienced lawyers for legal systems, professional traders for financial agents, and so on.

Testing Non-Deterministic Systems

Traditional unit tests assume that, given the same input, your system will produce the same output. This assumption contradicts the behaviour of agents, which may employ different valid approaches to solve the same problem depending on the context, previous interactions, or even random factors in their reasoning process.

Agent evaluation requires different methodologies because you’re testing outcomes rather than processes, measuring performance across diverse scenarios rather than checking specific outputs.

Some key evaluation metrics include:

Intent resolution – Does the agent correctly understand what users want and provide solutions that address their needs?
Task completion accuracy – Does the agent actually accomplish what it was asked to do?
Tool call precision – Does the agent use tools correctly and handle failures gracefully?
Conversational efficiency – How many turns does it take to complete tasks without unnecessary back-and-forth?

The key point is that agents are evaluated at two distinct levels: end-to-end evaluation, which treats the entire system as a black box, and component-level testing, which examines tool usage, reasoning chains, and decision-making.

Five Proven Workflow Patterns For Building AI Agents

Every agent deployment faces unique challenges – some need to handle unpredictable user requests, others must coordinate multiple specialised tasks, and many require adapting to changing business requirements over time.

However, there are still core best practices to follow – building blocks that can be combined and adapted based on your specific requirements:

Prompt Chaining breaks complex tasks into sequential steps, with each LLM call handling one specific piece. This trades latency for accuracy and makes debugging considerably easier. Each step – extraction, validation, formatting – becomes a discrete, testable component. When something fails, you know precisely where to investigate.
Routing utilises intelligent triage to direct requests to the most suitable model for each task. Simple queries hit fast, economical models while challenging problems receive the full power of advanced systems. Your costs decrease substantially while performance remains consistently high across different request types.
Parallelisation runs independent tasks simultaneously and then combines the results. Multiple documents can be analysed concurrently, with different threads handling distinct aspects of the workload. You achieve significant throughput improvements without the coordination overhead typically associated with true multi-agent systems.
Orchestration coordinates specialised agents for multi-domain challenges without the overhead of constant inter-agent communication. Different agents handle distinct specialities – diagnostic analysis, planning, administration – working together but maintaining focused independence. Each agent concentrates on its core competency.
Evaluation builds feedback loops where agents assess their own work and refine approaches over time. Agents improve responses based on success metrics and resolution patterns. It creates a built-in quality control system that becomes more sophisticated with every interaction.

Start simple, add sophistication only when you can demonstrate it’s necessary, and you’ll build systems that function when real users begin testing them.

Build Powerful AI Agents With Aya Data

Ready to build agents that work in production? Aya Data provides the expert data annotation and workflow documentation that transforms ambitious agent projects into reliable production systems.

Our domain specialists help create the training data that captures complete decision-making processes, not objects and outcomes alone.

From healthcare diagnostics to financial fraud detection, we partner with industry experts to annotate the cognitive workflows that teach agents how professionals think through uncertainty, handle edge cases, and adapt when standard approaches fail.

Contact us to discuss how our data services can accelerate your agent development and ensure your systems work when they collide with real users.

Aya Data – Domain specific data annotation services for major dataset types and industries Reliable AI data collection services to train machine learning models AI consulting experts in designing and deploying tailored AI solutions for businesses

Building AI Agents That Work – Challenges and Best Practices