The gap between what we call “multimodal AI” today and truly unified systems is much wider than most people realise.
While current AI does an impressive job handling text and images, or processing audio and video separately, we’re aiming for something fundamentally different – systems that can think across all human communication channels at once, integrating and generating content as naturally as we combine information from multiple senses.
This is no incremental upgrade – it’s a complete reimagining of how machines understand and interact with the world.
But achieving true multimodal unification requires solving engineering challenges that make today’s AI development look straightforward by comparison. Read on to learn about how researchers are creating more complex multi-modal systems that understand a diverse array of inputs.
The Illusion of Current Multimodal AI
Before 2025, when you uploaded an image to a generative AI tool and asked a question about it, you were essentially witnessing an elaborate engineering trick.
Behind the scenes, that image gets processed by a completely separate vision system, converted into mathematical representations, and only then combined with your text through complex fusion networks.
So, while ‘multi-modal’ in a practical sense, most widely available gen AI tools today analyse the media in the chat interface (text, image, audio) and run the necessary system to process the prompt. They’re not ‘integrated’ on the back end in a way you might imagine them to be.

Above: A history of AI technology, with multi-modal technology breaking out around 2020. Source: Wikimedia Commons.
Nevertheless, we are still very much in the era of multi-modal generative AI, with most widely available systems working across multiple modalities, from text to image and audio.
It won’t be long before video is widely available. In a research and business context, multi-modal will soon evolve to process everything from IoT sensor data and LIDAR to chemical sensing and touch (via haptic feedback).
The Modular Technology Behind Current Systems
The above is what researchers refer to as “late fusion” architecture, where each type of input – text, images, audio, or video – is first processed separately by specialised encoders before their representations are combined later in the model.
This makes it easier to swap or improve individual components and has become the standard in many real-world systems, from image–text retrieval models to video captioning pipelines.
Here’s how it works:
- Vision encoders (typically Vision Transformers or CNNs) convert images into high-dimensional vector representations
- Text encoders process language into separate mathematical embeddings using transformer architectures
- Fusion networks attempt to find relationships between these isolated representations through cross-attention mechanisms
- Output decoders generate responses based on these combined but originally separate encodings
The fundamental problem with this? Each type of input gets processed on its own before it’s combined.
When you upload an image and ask a question, the system first turns that image into a fixed set of numbers without knowing what you’ll ask about it. That means it has to choose in advance what details to keep, and once that encoding is done, it can’t go back and “look” for new details it didn’t save.
For example, if it didn’t bother to preserve the text on a tiny sign in the background because it didn’t think it was important, it can’t answer “What does the sign say?” later. Or if it smoothed over subtle facial cues, it can’t reliably answer “What emotion is this person showing?”.
Unlike a human who can refocus on specific parts of an image when asked a question, these systems can only work with whatever information they chose to keep the first time around.
Information Bottlenecks in Modular Design
Most late fusion or modular designs can create systematic information bottlenecks – each modality is forced to compress its raw data into fixed-size vectors before the system even knows what questions will be asked. As a result, subtle but important details can be lost.
For example, consider the timing between a speaker’s words and their facial expressions – critical for interpreting emotion and intent. When audio and video are processed entirely separately, that synchrony is discarded.
Or take analysing a video of a musical performance. A modular system might extract visual features of finger movements, audio features of the notes being played, and text from any commentary – each in isolation.
But understanding how a pianist’s finger position at one moment produces a specific harmonic at another requires reasoning about timing and relationships across these modalities simultaneously.
That level of integrated understanding is something most modular systems simply can’t achieve yet.
The Native Multimodal Breakthrough
Then came GPT-4o. Unlike every previous system, GPT-4o’s neural networks were trained on images and audio simultaneously with text, creating unified representational spaces from the ground up.
The performance difference is dramatic. Previous voice interaction systems required three-model pipelines with latencies of 2.8 to 5.4 seconds.
GPT-4o integrates these functions into a single model, enabling response times averaging 0.32 seconds while preserving nuanced information, such as tone, emotion, timing, and background context, throughout the reasoning process.

Above: GPT-4o trained on multiple data modalities within the same unified neural network.
GPT-4o utilises a single neural network to process inputs and generate outputs, demonstrating what is possible with unified multi-modal architectures. It sets a new benchmark for commercial-grade multi-modal gen AI tools.
What Modalities Are AI Models Integrating Today?
Building truly unified generative AI means handling the same rich mix of signals humans use to communicate and understand the world. However, the hard bit is integrating them meaningfully.
Today’s most advanced systems handle three primary modalities with varying degrees of sophistication:
- Text processing (natural language processing) – the most mature area of machine learning development, where AI demonstrates sophisticated reasoning. Models can solve mathematical problems, generate and debug code, handle complex logical arguments, and manipulate abstract concepts using advanced transformer architectures.
- Vision and image understanding (computer vision) – now goes far beyond simple object recognition. Systems can interpret medical scans, analyse complex data visualisations, perform advanced OCR on images, understand 3D spatial relationships, and generate increasingly photorealistic images from text prompts.
- Audio processing – the newest mainstream modality, growing rapidly. Models can recognise emotional tone in speech, identify speakers in group conversations, analyse musical compositions, generate realistic synthetic speech with specific vocal characteristics, and are starting to interpret environmental sounds for richer context awareness.
These three modalities form the backbone of current multimodal AI, but they’re just the beginning of what’s possible.
Video: The Complexity Multiplier
Video processing represents an exponential leap because it introduces temporal reasoning. It involves understanding moving images and their accompanying sounds.
Current video processing typically samples individual frames and processes them as static images, fundamentally losing temporal continuity. True video understanding requires:
- Maintaining coherent representations across extended sequences
- Understanding causal relationships between actions and consequences
- Processing synchronised audio tracks as integral rather than separate components
- Grasping how scenes transition and evolve over time
The most advanced systems can analyse hour-long videos and answer questions about events spanning extended timeframes, but we’re still nowhere near AI that understands video with human-like temporal fluency.
Models like DeepMind’s Flamingo or Video-LLaMA can process extended video content by sampling frames or dividing it into clips, but they often lose subtle narrative flow, causal relationships, and the evolving context that humans naturally follow. We’re still nowhere near AI that understands video with human-like temporal fluency.
What’s Next For Multi-Modal AI?
Emerging modalities will take AI far beyond just text, images, and audio. Research teams are pushing into entirely new sensory territories that will define the next generation of unified systems:
- 3D spatial data: LiDAR point clouds, depth sensor information, spatial scene reconstruction, and volumetric understanding
- Sensor networks: IoT device telemetry, environmental monitoring data, industrial sensor readings, and other real-time measurement streams
- Biometric signals: Heart rate variability, brain activity patterns, eye tracking data, galvanic skin response, and physiological stress indicators
- Haptic information: Touch pressure, texture recognition, temperature sensing, and tactile feedback processing
- Chemical sensors: Olfactory data, taste analysis, air quality measurements, and molecular composition detection
To build AI systems that can handle many types of data – like text, images, audio, and sensor signals – without using too much computing power, researchers use special architectures called multimodal transformers or mixture-of-experts models.
All of this depends on some key technical advancements that make unified processing possible in the first place.
One of the most important foundations is the way these systems turn different types of data into a common mathematical form that they can reason over. That begins with developing methods for unified tokenisation.
Unified Tokenisation: The Mathematical Foundation of Multi-Modal AI
A critical step toward creating unified multimodal AI is developing tokenisation schemes that turn all types of data into a shared mathematical space.
Instead of running text, images, audio, and sensor data through totally separate encoders, these models aim to represent them as the same kind of tokens that can move through shared transformer layers. This allows the model to learn connections between modalities from the ground up, rather than stitching them together at the end.
Researchers are exploring several strategies:
- Vector quantisation – turns continuous signals like audio or images into a fixed set of discrete codes, making unstructured data “language-like” so transformers can process it consistently. It’s critical to compress high-dimensional inputs while retaining important features.
- Patch-based representations – breaks data like images, video, or even audio spectrograms into uniform patches or segments. Treating all modalities as sequences of patches enables transformers to handle them with the same architecture, making cross-modal attention straightforward.
- Shared embedding spaces – trains encoders so that different modalities land in the same high-dimensional space, letting the model learn semantic relationships between, say, a spoken phrase and a visual scene without modality barriers.
- Learned joint tokenisers – builds vocabularies that work across text, images, and audio, so all inputs use a single, consistent symbolic representation from the start.
- Multimodal autoencoding – reconstructs multiple modalities jointly during training to force the model to capture shared structure and align concepts across input types.

Above: Unified tokenisation is vital for creating multimodal systems that understand different data types simultaneously and continuously.
These strategies are pushing us toward models that don’t just accept multiple modalities, but actually understand and reason across them natively, supporting richer, more human-like integration of sensory information.
Cross-Modal Attention at Scale
Today’s cross-attention mechanisms can link two modalities fairly well, such as matching a question to an image, but scaling this up to handle many modalities simultaneously is a significant challenge.
Attention grows quadratically with the number of modality pairs, making it increasingly expensive and harder to use in real-world systems. Imagine processing a minute of video that includes synchronised audio, subtitles, depth maps, and biometric data all at once – the compute demands a rocket.
To address this, research is exploring solutions like:
- Sparse attention patterns : Instead of attending to every pair, the model learns to focus only on relevant cross-modal links, saving compute and improving interpretability.
- Hierarchical processing : models high-level, broad relationships first before diving into detailed connections, making the reasoning process more scalable and efficient.
- Mixture-of-experts architectures : activates only the relevant parts of a massive network for specific modality combinations. This keeps large models manageable and allows specialisation within subcomponents.
- Routing and gating mechanisms : dynamically decides which modalities need to be fused for a given task, avoiding unnecessary computation and enabling context-dependent reasoning.
- Modular integration layers : designs flexible parts of the network that can combine information from any subset of modalities without retraining the entire model for new combinations.
These solutions help keep computational costs manageable as AI systems process multiple types of data simultaneously. By focusing only on the important connections and using smart ways to organise and route information, future models can work faster and smarter. This will be essential for making multimodal AI practical and effective in real-world situations.
Recent Breakthroughs in Multi-Modal Gen AI
The transition from theory to reality is accelerating. Google DeepMind’s experimental web-browsing agent, called Mariner, can autonomously search the web, select products, and adapt when encountering problems – for example, determining which type of flour to choose when multiple options are available.
OpenAI has introduced Operator, a platform that enables developers to build AI agents capable of performing multi-step tasks through natural language, such as booking appointments, updating databases, or managing end-to-end workflows by connecting to tools and APIs.
We’re also seeing practical agent systems emerging in customer support and sales automation, such as Intercom’s Fin and HubSpot’s ChatSpot, which use large language models to handle complex customer queries, summarise conversations, and even automate CRM updates.
In software development, GitHub Copilot Workspace aims to act as an AI pair programmer that can plan tasks, write code, and refactor entire projects through natural-language collaboration, demonstrating how agentic AI can move beyond answering questions to executing real work.
Meanwhile, research prototypes like Voyager demonstrate agents capable of continuous learning and planning in open-ended environments, such as exploring and crafting in Minecraft, by remembering goals, trying different strategies, and refining their skills over time.
These examples represent early but meaningful steps toward AI that doesn’t just respond passively to inputs, but can plan, adapt, and act autonomously in real-world settings.
The Engineering Challenges Blocking the Future
Training unified systems requires datasets where all modalities are perfectly synchronised, and creating these at scale is extraordinarily difficult.
The process of combining different modalities, such as text, images, and speech, is not always straightforward. Each modality has specific characteristics that complicate processing and synthesis.
The challenge isn’t just technical – it’s logistical. Imagine creating a dataset where every frame of video syncs precisely with corresponding audio, text annotations, sensor readings, and metadata across millions of examples. A single misaligned example can degrade model performance across thousands of related training instances.
Combining data from multiple sources poses significant challenges due to differences in data format, timing, and interpretation. The scale required for robust, unified systems amplifies these problems exponentially.
Computational Resource Explosion
Multimodal AI systems require substantial computational resources to process and analyse large volumes of data from multiple modalities. Unified systems amplify this challenge dramatically.
Current estimates suggest truly unified systems will require:
- Training compute: 10-100x more FLOPs than current large language models due to cross-modal attention complexity
- Memory requirements: Massive high-bandwidth memory for processing long sequences across multiple modalities simultaneously
- Inference costs: Real-time processing demands dedicated hardware architectures optimized for multimodal computation
This isn’t just an engineering challenge – it’s an economic one. Only organisations with enormous computational resources will be able to train the most advanced unified systems, potentially concentrating this transformative technology among a few major players.
The Evaluation Puzzle
How do you benchmark a system that can process any combination of modalities and generate coordinated responses across all of them? Current AI evaluation focuses on specific tasks within single modalities; however, unified systems require entirely new measurement frameworks.
The research community is developing cross-modal reasoning benchmarks that test genuine integration rather than modular processing, as well as creative coordination evaluations that measure consistency across output modalities.
Additionally, real-world task performance metrics are being developed that mirror actual human work, rather than artificial laboratory tasks. Until we can properly measure progress, it’s difficult to know when we’ve achieved true multimodal unification.
Building The Foundation For True Multi-Modal Gen AI
The race toward fully unified multimodal Gen AI systems will ultimately be won by whoever can crack the data problem.
Technical architectures and computing power matter hugely, but every real breakthrough depends on having training datasets where all modalities are perfectly synchronised, semantically aligned, and thoroughly annotated at an unprecedented scale.
This is likely the most complex data challenge in the history of AI. The teams that master multimodal data annotation, temporal alignment, and semantic consistency across dozens of modality types will be the ones that unlock the next generation of AI breakthroughs.
At Aya Data, we’re building the expertise and infrastructure to support this shift. Our teams combine deep domain knowledge with rigorous processes optimised specifically for the multimodal datasets that will power tomorrow’s unified AI systems.
Whether you’re developing next-generation models or aiming for agentic AI, your success will depend entirely on the quality of the data your systems learn from.
Contact us to discover how we can help establish the training data foundation your breakthrough AI project requires.