Two annotators examine a medical scan, annotate their findings, and compare results that often appear disparate. When that disagreement score lands in a QA report: the dataset either clears or fails a threshold, and the underlying disagreement data is discarded. Yet that discarded data is the most useful signal in the entire process.
The best medical AI teams recognise this. Rather than waiting until annotation is complete, they use inter-annotator agreement in real time, treating disagreement as a quality indicator that exposes where the system is actually failing.
When medical annotators disagree, the problem is rarely the annotators themselves. Instead, it’s usually one of four breakdowns: unclear guidelines, ambiguous definitions of pathology, inconsistent reviewer interpretations, or inadequate escalation workflows. In other words, inter-annotator agreement measures whether your annotation system has given reviewers the clarity and structure they need to perform consistently.
This article explores how to operationalise inter-annotator agreement as a governance discipline and quality signal in medical AI. While most teams treat disagreement as a pass-fail threshold in a QA report, we’ll discuss how to use it as a continuous source of improvement.
What Is Inter-Annotator Agreement and Why Does it Matter?
Inter-annotator agreement measures the extent to which multiple annotators apply the same labels to the same data. In medical image annotation, it’s commonly used to assess consistency among radiologists, pathologists, clinicians, and trained medical annotators working on the same projects.
When multiple qualified reviewers reach similar conclusions on the same annotation task, confidence in the resulting labels generally increases. If they frequently disagree, it typically indicates a problem with the annotation process itself, the guidelines provided, or the nature of the task.
Several metrics quantify inter-annotator agreement, with each suited to different annotation scenarios. Percentage agreement is the simplest approach, measuring how often annotators reach the same conclusion, but it overestimates consistency because it doesn’t account for chance agreement. Cohen’s Kappa corrects this by accounting for chance agreement, providing a more meaningful assessment of consistency, whilst Fleiss’ Kappa extends this principle to projects with more than two annotators.
Yet the numerical score often obscures what matters most. A dataset that achieves a Cohen’s Kappa of 0.85 may proceed without further investigation, whereas another project with a score of 0.62 may warrant serious concern. The score itself tells an incomplete story.

Where Annotator Disagreement Actually Comes From
Annotator disagreement is usually a signal that something within the annotation system deserves closer examination. For example, a chest X-ray dataset in which annotators disagree on whether a finding constitutes “mild” or “moderate” cardiomegaly isn’t a labelling failure. Rather than reflecting annotator incompetence, disagreement often reveals a combination of clinical complexity, guideline ambiguity, tooling limitations, and gaps in reviewer interpretation.
1) Clinical Ambiguity
Many medical findings exist on a spectrum rather than as clear binary classifications, particularly in radiology and pathology workflows. A subtle lung nodule, an early-stage lesion, or an irregular tissue boundary may be interpreted differently by equally qualified reviewers, each applying valid but slightly different clinical reasoning to the same image.
2) Guideline Ambiguity
Beyond clinical interpretation, the instructions guiding that interpretation can also be a source of confusion. An annotation guideline may specify that reviewers should label tumour boundaries but fail to define how to handle peripheral tissue involvement. This disparity points to weaknesses in the annotation protocol itself rather than to annotator performance.
3) Reviewer Calibration
As annotation projects scale, small differences in interpretation can compound across thousands of images. One reviewer may gradually become more conservative in their assessment, whilst another adopts a broader interpretation of inclusion criteria. Over time, these calibration drifts can create systematic disagreement unrelated to the quality of the annotators involved.
4) Tooling and Workflow Limitations
The annotation environment itself can also contribute to disagreement. Consider a CT study where one reviewer examines a lesion using an optimised lung window whilst another uses a soft-tissue window; the visibility of the same finding may differ significantly between viewing configurations, resulting in inconsistent annotations despite both reviewers following the same protocol.
5) Edge Cases and Rare Findings
Some annotation projects encounter cases that fall outside normal expectations. Rare pathologies, poor-quality scans, imaging artefacts, incomplete studies, and unusual anatomical presentations all fall outside the scenarios anticipated by the original annotation guidelines.

Disagreements become useful when mature annotation teams view each failure as evidence. Every disagreement contains information about how the annotation system is functioning, revealing weaknesses in reviewer calibration, limitations in the tooling, and flaws in specific annotation guidelines.
Why High Agreement Does Not Always Mean High Dataset Quality
High agreement often indicates that annotators are applying the same labels consistently. Yet consistency does not necessarily mean they are applying the correct labels whilst capturing the nuances required for that specific AI application. A dataset where annotators agree perfectly can still fail to reflect clinical reality.
If annotator guidelines contain ambiguous definitions based on incorrect assumptions, applying those flawed instructions may still yield high agreement. An annotation protocol may define lesion boundaries too narrowly or exclude clinically relevant surrounding tissue. If every annotator consistently follows those instructions, agreement scores may remain high even if the resulting labels fail to reflect the reality that the clinical model will encounter.
Agreement around incomplete ontologies adds another dimension to this problem. Medical AI projects often rely on predefined annotation schemas that determine which findings should be labelled and how they should be categorised. When the ontology itself misses important clinical variations, annotators consistently apply broad labels to complex findings, and the resulting strong agreement masks the loss of clinically relevant information.
How Medical AI Teams Operationalise Inter-Annotator Agreement
Mature medical AI teams view IAA as a continuous feedback mechanism that improves annotation workflows, strengthens dataset governance, and ultimately produces more reliable training data. The best outcomes often stem from the operational insight generated by the disagreement behind the score.

1) User Disagreement with Validation Annotation Guidelines
Repeated disagreement on the same annotation often indicates that the guidelines leave room for interpretation. Rather than forcing agreement through repeated corrections, mature annotation teams investigate these patterns of disagreement and update guidelines accordingly. This cycle of discovery and refinement prevents similar disagreements from recurring across the broader dataset.
2) Reviewer Calibration Gaps
Guidelines may be clear, but reviewers may apply them differently, particularly in long-running projects where annotators gradually develop their own interpretation of inclusion criteria, pathology definitions, or segmentation rules. To identify where calibration begins to drift before inconsistencies spread throughout the dataset, teams introduce targeted review sessions, consensus meetings, and retraining exercises. These interventions catch drift early rather than allowing it to compound across thousands of images.
3) Specialist Escalation Pathways
Some findings are inherently ambiguous and cannot be resolved in a single pass. Disagreements involving subtle pathology, uncertain lesion boundaries, or rare disease presentations may be routed to senior radiologists or specialist adjudicators for a second opinion. Over time, these escalated cases become learning assets that improve both reviewer performance and annotation protocols.
4) Improving Ontology Design
If reviewers repeatedly struggle to classify certain findings, it may indicate that the available label categories do not adequately represent the clinical reality within the data. This limitation is often common in medical imaging projects where disease presentation exists on a continuum rather than within neatly separated categories. To prevent these weaknesses from affecting model training, teams analyse trends in disagreement, refine label taxonomies, and introduce additional classifications.
5) Quality Assurance
One of the most practical uses of inter-annotator agreement is helping teams focus quality assurance where it is needed most. Cases with consistently high agreement may require only routine validation, whilst those with recurring disagreement may warrant deeper investigation, additional review cycles, or specialist adjudication. This targeted approach to QA ensures resources are deployed effectively.
6) Monitoring Dataset Quality
Agreement trends can reveal emerging issues long before they become visible in model performance metrics. When tracked over time, IAA serves as an early warning system for annotation quality and dataset stability.
7) Governance Mechanism
At the basic level, IAA measures consistency, whilst at an advanced level, it helps organisations improve the entire annotation pipeline. In mature teams, it is used continuously to improve the systems that create dataset quality.
Conclusion
The value of IAA is what the score reveals about the annotation system. While disagreement is often treated as a quality assurance metric that helps teams measure annotation consistency, the disagreement behind the score signals where annotation processes break down, not whether they work.
Medical AI advancement brings new constraints: regulatory alignment, data anonymity and patient protection. Organisations that treat inter-annotator agreement as a governance signal rather than a reporting metric will build clinically meaningful and operationally scalable datasets. This distinction becomes critical as AI deployment scales.
At Aya Data, we help healthcare AI teams build high-quality medical imaging datasets through expert-led annotation workflows, structured quality assurance processes, and governance frameworks designed for real-world AI deployment. To learn more about medical image annotation, check our service offering or schedule a 15-minute discovery call with one of our experts.
Frequently Asked Questions
What is inter-annotator agreement?
Inter-annotator agreement (IAA) measures how consistently multiple annotators apply labels to medical images, evaluating whether radiologists, pathologists, or trained annotators interpret and annotate in the same way.
What is a good inter-annotator agreement score?
The same CT scan can be displayed with different window and level settings, depending on whether the annotator is evaluating lung tissue, soft tissue, or bone. A lesion that appears clearly in one configuration may be partially obscured in another, so annotators working under different viewing conditions can produce inconsistent labels without realizing it.
What is Cohen’s Kappa in inter-annotator agreement?
Cohen’s Kappa measures agreement between two annotators while accounting for chance agreement, making it one of the most reliable inter-annotator agreement metrics in medical AI.
Why do annotators disagree on medical images?
Disagreement arises from clinical ambiguity, unclear guidelines, reviewer calibration drift, tooling limitations, and rare pathology presentations. Rather than indicating poor annotator performance, it typically highlights where the annotation process needs improvement.
Why is inter-annotator agreement important in medical AI?
Inter-annotator agreement identifies weaknesses in guidelines, reviewer calibration gaps, and dataset quality issues before they affect model performance. When used as a governance signal, it strengthens training data and prevents poor annotation from degrading downstream models.
Can high inter-annotator agreement still result in poor-quality datasets?
Yes, high agreement indicates consistency but not correctness; annotators may consistently apply flawed guidelines or incomplete ontologies. Use inter-annotator agreement alongside expert review and quality assurance rather than as a standalone quality measure.
Written by

Head, Data Annotation
Akhil Singh is the Head of Business Unit, Annotation, at Aya Data and a seasoned AI and GTM leader. With over a decade of experience scaling B2B sales and operations across AI, data, and SaaS, he combines commercial discipline with deep domain expertise to drive growth across computer vision, GenAI, and Medical AI.
