The conversation about AI “agents” has become muddied by marketing hype. Every company with a chatbot now claims to have built autonomous agents.
But strip away the noise, and we are now witnessing AI systems that can reason, plan, and act with increasing independence.
The key insight that most coverage overlooks is that agency exists on a spectrum. We’re not dealing with a binary switch from “dumb tool” to “fully autonomous agent.”
Instead, we’re observing a gradual progression through different levels of power and ability, from basic tool use to sophisticated reasoning to potential consciousness. There is no binary category of “agentic AI.”
Let’s dive into the rise of agentic AI, exploring today’s systems and what they might look like in the not-too-distant future.
The Agency Spectrum: From Tools to Consciousness
To understand where we’re headed, we need to map the territory of agentic AI.
We can break down agency in AI systems into five levels, which, albeit arbitrary, represent a meaningful jump in autonomy and what these systems can accomplish:
Level 1: Reactive Tool Use
At this level, AI is designed to use various tools for specific tasks. In the early days of ChatGPT, people would often hand it simple math problems and laugh at its clumsy, error-prone answers, saying: “It’s only a language model – it can’t do math!”
Not today. Current-generation AI systems can delegate computations to Python environments (to solve equations or run code), call SDKs and APIs for development tasks, and search the web, etc.
However, they don’t have persistent goals or memory between conversations (unless explicitly designed with user-enabled memory features). They respond to user instructions but do not initiate actions independently.
Level 2: Goal-Directed Planning
Level 2 represents systems that can break down complex tasks into steps and maintain state across interactions.
A customer service agent who can access multiple systems, understand context, and follow multi-step resolution processes fits here. It’s bounded, but it can think ahead and pursue specific objectives.
Level 3: Multi-Domain Reasoning
This is where things get interesting. These systems can transfer knowledge across domains, reason about novel situations, and adapt strategies based on feedback.
Think of a system that learns from resolving billing disputes and applies that learning to technical support issues. The reasoning transfers, the patterns connect.
OpenAI’s Operator and agents represent level 2 and level 3 agentic systems that can establish goals and reason across domains to achieve them.
Level 4: Strategic Autonomy
Level 4 is a broad category that encompasses systems that set their own sub-goals, learn from long-term outcomes, and operate with minimal oversight across various domains. This could be simple – ChatGPT and Claude’s deep research functions already develop plans before executing different tasks in parallel to achieve a goal.
However, at the upper end, an AI could develop much more sophisticated strategies, e.g., learning how to open browsers, download various tools, and operate them together to achieve tasks such as programming an entire game from scratch.
There is a critical point of distinction here. The most advanced agentic AI tools primarily live in digital environments (e.g., as software). Companies such as DeepMind and OpenAI are, however, working on integrating their state-of-the-art models into robots with sensory capabilities that can learn open-ended tasks.
Above: The Google Gemini robotics project seeks to make semi-autonomous, agentic robots easier to develop.
Level 5: Conscious Agency
Level 5 represents theoretical systems that exhibit some of the 14 indicators of consciousness identified by researchers, ranging from recurrent processing to metacognitive monitoring to genuine agency. This includes self-awareness, the ability to model other minds, and genuine preferences rather than programmed objectives.
Most current “agentic” systems operate between Levels 1 and 2, with a few reaching Level 4 in narrow, well-defined domains.
Ultimately, though, we’re still far from true agency in the sense seen in biological organisms, especially humans, who demonstrate flexible goals, self-reflection, and rich world modelling.
Reaching something closer to human-like agency would likely require advances such as more efficient architectures (e.g., neuromorphic computing), much greater computational power, improved training data and methods, and perhaps smaller, energy-efficient systems that can operate continuously in real-world settings.
The Current State of Agentic AI
Commercial, widely available models are now breaking into the lower levels of agentic AI, led by OpenAI’s “o” series. GPT-o1 introduced “chain-of-thought” reasoning directly into the model’s processing.
Unlike previous systems that generated responses immediately, o1 pauses to think – sometimes for minutes – before answering.
The difference is like asking someone to solve a complex math problem in their head versus giving them scratch paper and time to work through it step by step. The underlying intelligence might be the same, but the ability to tackle complex, multi-step problems transforms.
The results are striking. o1 achieved 90.8% on MMLU (a comprehensive knowledge benchmark) and performed at the level that would officially pass Mensa admission requirements.
More importantly for agentic applications, it could maintain coherent reasoning across complex, multi-step problems without losing track of the overall goal. In tests, it was discovering its own methods for solving problems, sometimes finding solutions that surprised even its creators.
The o3 Leap (December 2024)
Just three months later, OpenAI announced o3, and the performance jump was extraordinary. On the ARC-AGI benchmark, specifically designed to test genuine intelligence and adaptation to novel tasks, o3 achieved 87.5% accuracy, compared to human performance of 85%.
To put this in perspective: it took four years for AI models to progress from 0% to 5% on ARC-AGI. Then o3 shattered all previous limitations in a matter of months.
Variable Compute: The Key Innovation
What makes the “o” series genuinely different from previous models is variable compute. o3 can be set to low, medium, or high compute modes.
The higher the compute, the longer it thinks, and the better it performs. At high compute, o3 can spend thousands of dollars worth of processing time on a single problem.
This represents a change from scaling pre-training (making models larger) to scaling inference (allowing them to think more deeply). Thus, instead of building a bigger brain, we’re teaching the existing brain to think more carefully.
Real-World Agent Deployments
While reasoning models grab headlines, practical agent deployments reveal both the promise and current limitations of agentic AI. The gap between lab results and production reality remains significant; however, successful deployments share important characteristics.
Here are some examples of agentic AI used across industries today:
Software Development
Coding is becoming more AI-driven and agentic, with the latest tools like Cognition Software’s Devin and OpenAI’s o series representing the current state of the art in autonomous programming.
Launched as an “autonomous software engineer,” Devin can write code, debug applications, and even train machine learning models. It handles the entire software development lifecycle from requirements to deployment.
But reality is more nuanced. Devin resolves only 14% of real-world GitHub issues, which is twice as good as standard chatbots, but far from complete autonomous operation.
The broader impact is undeniable. 97% percent of developers now utilise AI coding tools, representing genuine evolution in how software is developed.
Industrial Agentic AI
Manufacturing and logistics are early beneficiaries of agentic systems that handle multi-step, cross-system workflows. For example, production planning systems detect low inventory levels, evaluate alternative suppliers based on price and lead time, generate purchase orders in ERP systems, and adjust production schedules to account for delivery delays.
Warehouse automation systems coordinate fleets of autonomous mobile robots that navigate dynamic environments, fulfil orders, and optimise packing layouts.
Siemens uses agentic AI for predictive maintenance, monitoring real-time sensor data to forecast failures and schedule interventions autonomously, reportedly reducing downtime by ~25%.
Task-Level Automation in Healthcare
Healthcare systems increasingly deploy agent-like automation for specific administrative and clinical workflows. Insurance pre-authorisation agents can extract patient data from electronic health records (EHRs), validate coverage criteria, and submit standardised requests automatically.
Hospitals use AI to transcribe and summarise clinician–patient conversations into structured notes that meet billing requirements.
Clinical decision support agents suggest care plan updates or flag drug interactions by querying knowledge bases and patient histories. DeepMind, for example, is collaborating with BioNTech to develop lab assistants that assist scientists in planning experiments and monitoring equipment.
Scientific Research and Discovery
Laboratory automation is seeing the emergence of agentic AI systems that plan and execute entire experimental workflows. DeepMind, in collaboration with Lawrence Berkeley National Lab, built an autonomous lab that designed, synthesised, and tested dozens of new materials with minimal human intervention.
A research team from the University of Science and Technology of China has also demonstrated agentic AI that autonomously developed a catalyst from simulated Martian rock to enable oxygen production through water-splitting.

Above: Researchers from the University of Science and Technology of China developed AI agents that autonomously designed and synthesised a catalyst from simulated Martian rock. This enables water-splitting to produce oxygen on Mars for future missions. (Source: Nature)
In the pharmaceutical industry, teams have integrated robotic systems with AI planners to automate the scheduling and execution of drug discovery experiments, simulating reaction outcomes and refining hypotheses through iterative cycles.
NVIDIA’s BioNeMo has also been utilised to aid in automating protein design pipelines by predicting molecular structures and guiding laboratory work.
The Challenges of Building Reliable Agents
AI’s agentic trajectory is clear, even if the timeline remains uncertain. OpenAI’s Sam Altman calls o3 “the beginning of the next phase of AI,” focused on increasingly complex reasoning tasks.
Meta’s latest models show similar reasoning abilities, while open-source alternatives are rapidly closing the gap, bringing complex multi-step workflows to a broader audience.
However, building reliable, production-ready agents demands more than ingenious algorithms. It requires selecting the right tools and building robust foundations spanning expertly labelled data, domain-specific knowledge, and rigorous validation to ensure consistent behaviour in the real world.
At Aya Data, we see these demands daily – from medical systems that rely on pathologist-labelled tissue samples to autonomous vehicles trained on millions of annotated driving scenarios.
The path to truly reliable agentic AI runs through meticulous preparation and domain expertise.
Getting these foundations wrong can derail even the most promising projects. In Part 2, we’ll explore why these factors often determine whether agentic AI succeeds or fails in practice, and examine the challenges of building effective systems that perform reliably outside the lab.