What Is Inter-Rater Reliability? | Definition & Examples

Published on October 24, 2025 by Julia Merkus, MA

Inter-rater reliability is the degree of agreement or consistency between two or more raters evaluating the same phenomenon, behavior, or data.

In research, it plays a crucial role in ensuring that findings are objective, reproducible, and not dependent on a single observer’s judgment. Whether you’re analyzing questionnaire responses, coding interview transcripts, or assessing behavioral outcomes, strong inter-rater reliability boosts the credibility of your results.

The method to calculate inter-rater reliability depends on the type of data (e.g., nominal, ordinal, interval, or ratio) and the number of raters. The most common methods include Cohen’s kappa (for two raters), Fleiss’ kappa (for three or more raters), and the Intraclass Correlation Coefficient (for continuous data).

Example: Inter-rater reliability in qualitative research

A team of psychologists conducts qualitative research to examine workplace stress. They code open-ended interview responses to identify themes such as “workload pressure,” “lack of support,” and “burnout symptoms.”

During the thematic analysis, each researcher independently reads the transcripts and assigns codes to relevant passages. If the coders consistently label the same responses with the same themes, the study demonstrates high inter-rater reliability. This consistency ensures that the findings reflect real patterns in the participants’ experiences, rather than subjective interpretations or biases from individual researchers. This helps reduce observer bias.

Tip

If you need to write a research report, thesis, or journal article, ensuring your grammar and clarity are top-notch is just as important as strong research methods. Use our Grammar Checker to catch errors and make your writing professional and polished.

Free Grammar Checker

Inter-rater reliability examples
Inter- vs intra-rater reliability
How to calculate inter-rater reliability
Frequently asked questions about inter-rater reliability

Inter-rater reliability examples

Inter-rater reliability is a critical concept across many fields because it ensures that measurements and observations are consistent and trustworthy, regardless of who is conducting the assessment.

By using reliable coding, scoring, or evaluation methods, researchers can reduce subjective bias and strengthen the validity of their findings.

Inter-rater reliability in psychology example

Example: Inter-rater reliability in psychology

A team of psychologists conducted a quasi-experimental design to investigate children’s responses to stress in a lab setting. Children participated in structured tasks designed to elicit stress, such as solving challenging puzzles under time constraints.Two observers independently recorded specific behaviors (e.g., fidgeting, facial expressions, verbal complaints) using predefined categories such as “none,” “mild,” or “severe.”

By calculating inter-rater reliability, the researchers ensured that the behavioral ratings were consistent across observers and not influenced by individual bias. High inter-rater reliability supports the internal validity of the study, confirming that observed differences truly reflect the experimental conditions rather than the subjective judgments.

Inter-rater reliability in healthcare example

Example: Inter-rater reliability in healthcare

A research team at a hospital conducts a cross-sectional study to evaluate patients’ mobility after orthopedic surgery. Nurses use a standardized Likert scale questionnaire to rate patients’ ability to perform specific movements, such as sitting, standing, and walking.

Multiple nurses independently score each patient, and the inter-rater reliability is calculated to ensure that the ratings are consistent across observers. The consistency is crucial for monitoring recovery trends, guiding treatment decisions, and comparing outcomes across different hospitals. It also supports the study’s external validity, making it easier to generalize findings to the broader population of postoperative patients.

Inter- vs intra-rater reliability

When discussing reliability, it’s important to distinguish between inter-rater and intra-rater reliability, as both play crucial roles in ensuring accurate and consistent measurements in research.

Inter-rater reliability measures the consistency of ratings across different observers assessing the same phenomenon. It ensures that findings are objective and reproducible, regardless of who collects the data.
Intra-rater reliability measures the consistency of a single observer when assessing the same phenomenon multiple times. It ensures that an individual rater’s measurements are stable and not influenced by fatigue, memory, or subjective shifts.

Inter- vs. intra-rater reliability example

A research team conducts a study assessing pain levels in postoperative patients using observational ratings. Nurses evaluate patients’ facial expressions, body movements, and verbal cues to assign a pain score from 0 to 10.

Inter-rater reliability

Three different nurses independently assess the same group of 20 patients at the same time. The researchers calculate the ICC to determine how consistently the different nurses rate each patient. High inter-rater reliability indicates that the pain scores are not dependent on which nurse is observing.

Intra-rater reliability

One of the nurses re-evaluates the same patients’ pain levels two days later without referencing the previous scores. The researcher calculates ICC to see if this nurse’s ratings are consistent over time. High intra-rater reliability indicates that the nurse’s scoring is stable.

How to calculate inter-rater reliability

The method to calculate inter-rater reliability depends on the type of data and the number of raters. The goal is to determine how consistently different observers assess the same phenomenon. Using the right formula ensures your measurements are reliable, reproducible, and credible.

The four most common methods are:

Cohen’s kappa (two raters, categorical data)
Fleiss’ kappa (three or more raters, categorical data)
Intraclass Correlation Coefficient (continuous data)
Percent agreement (simple, but limited, method)

Cohen’s kappa

Cohen’s kappa (κ) is used when two raters classify items into categories (nominal data or ordinal data). It accounts for the agreement expected by chance, providing a more accurate measure than simple percent agreement.

The formula for Cohen’s Kappa is:

$\kappa = \dfrac {{{P}_o} - {{P}_e}}{{1} - {P_e}}$

Where:

Po = observed proportion of agreement between raters
Pe = expected agreement by chance

{P}_o = \dfrac {{{20 + 15}}}{50}} = {0.70} — Example: Calculating inter-rater reliability with Cohen’s kappa

{P}_e = \dfrac {{{25*30}}}{50^2}} + \dfrac {{{25*20}}}{50^2}} = {0.30 + 0.20} = {0.50} — Example: Calculating inter-rater reliability with Cohen’s kappa

	Rater 2: Anxious	Rater 2: Not anxious	Total
Rater 1: Anxious	20	5	25
Rater 1: Not anxious	10	15	25
Total	30	20	50

Fleiss’ kappa

When three or more raters are involved for categorical data, Fleiss’ kappa extends Cohen’s approach to handle multiple observers.

P_i = \dfrac {{{1}}}{n(n-1)} \sum_{j}{{n_{ij}({n}_{ij}-1)} — Example: Calculating inter-rater reliability with Fleiss’ kappa

P_1 = \dfrac {{{1}}}{5(5-1)}*[(3*2) + (2*1) + (0*-1)] = \dfrac {{{8}}}{20} = {0.40} — Example: Calculating inter-rater reliability with Fleiss’ kappa

Image	Likely benign	Possibly malignant	Clearly malignant	Total raters
1	3	2	0	5
2	0	4	1	5
3	1	1	3	5
4	5	0	0	5
5	2	2	1	5

Intraclass Correlation Coefficient (ICC)

The Intraclass Correlation Coefficient (ICC) is used when the intra-rater reliability is calculated for continuous data.

The ICC quantifies how much of the total variance in the measurements is due to true differences between subjects versus differences caused by raters or random error.

A high ICC (close to 1) means most of the variation reflects real differences between patients.
A low ICC means raters are inconsistent, so much of the variation is due to measurement error.

ICC = \dfrac {{{MS_B-MS_W}}}{MS_B+(k-1)*MS_W}} — Example: Calculating inter-rater reliability with ICC

ICC = \dfrac {{{1200-100}}}{1200+(3-1)*100}} = \dfrac {{{1100}}}{1400} = 0.79 — Example: Calculating inter-rater reliability with ICC

Percent agreement

The percent agreement method is the simplest way to assess inter-rater reliability. It measures the proportion of times that raters agree on a category or rating, without accounting for chance agreement like kappa does. This makes it less suitable for scientific research, but it is useful for preliminary checks or simple observational studies.

Percent Agreement = \dfrac {{{25}}}{30}}*{100} = 83.3% — Example: Calculating inter-rater reliability with ICC

	Rater 2: Correct	Rater 2: Incorrect	Total
Rater 1: Correct	15	3	18
Rater 2: Incorrect	2	10	12
Total	17	13	30

Frequently asked questions about inter-rater reliability

What is the formula for calculating inter-rater reliability?: There isn’t just one formula for calculating inter-rater reliability. The right one depends on your data type (e.g., nominal data, ordinal data) and the number of raters.

Cohen’s kappa (κ) is commonly used for two raters

Fleiss’ kappa is typically used for three or more raters

The Intraclass Correlation Coefficient (ICC) is used for continuous data (interval or ratio). This is based on analysis of variance (ANOVA)

The most used formula (for Cohen’s kappa) is:
$\kappa = \dfrac{{{P}_o}-{{P}_e}}{{1}-{P_e}}$
Po is the observed proportion of agreement, and Pe stands for the expected agreement by chance.
What is inter-rater reliability in psychology?: In psychology, inter-rater reliability refers to the degree of agreement between different observers or raters who evaluate the same behavior, test, or phenomenon.

It ensures that measurements are consistent, objective, and not dependent on a single person’s judgment, which is especially important in research, clinical assessments, and behavioral studies.

High inter-rater reliability indicates that results are dependable and reproducible across different raters.
What is a good inter-rater reliability score?: A good inter-rater reliability score depends on the statistic used and the context of the study.

For Cohen’s kappa (two raters), common guidelines are:

< 0.20: Poor agreement

0.21–0.40: Fair agreement

0.41–0.60: Moderate agreement

0.61–0.80: Substantial agreement

0.81–1.00: Almost perfect agreement

For the Intraclass Correlation Coefficient (interval or ratio data), similar thresholds are used:

< 0.50: Poor agreement

0.51–0.75: Moderate agreement

0.76–0.90: Good agreement

> 0.91: Excellent agreement

Cite this Quillbot article

We encourage the use of reliable sources in all types of writing. You can copy and paste the citation or click the "Cite this article" button to automatically add it to our free Citation Generator.

Merkus, J. (2025, October 24). What Is Inter-Rater Reliability? | Definition & Examples. Quillbot. Retrieved December 10, 2025, from https://quillbot.com/blog/research/inter-rater-reliability/

Cite this article

Is this article helpful?

You have already voted. Thanks :-) Your vote is saved :-) Processing your vote...

Julia Merkus, MA

Julia has a bachelor in Dutch language and culture and two masters in Linguistics and Language and speech pathology. After a few years as an editor, researcher, and teacher, she now leads the QuillBot content team. She also writes articles about her specialist topics: grammar, linguistics, methodology, and statistics.

What Is Inter-Rater Reliability? | Definition & Examples

Table of contents

Inter-rater reliability examples

Inter-rater reliability in psychology example

Inter-rater reliability in healthcare example

Inter- vs intra-rater reliability

How to calculate inter-rater reliability

Cohen’s kappa

Fleiss’ kappa

Intraclass Correlation Coefficient (ICC)

Percent agreement

Frequently asked questions about inter-rater reliability

Cite this Quillbot article

Is this article helpful?

Julia Merkus, MA

Join the conversation

Paraphraser

Grammar Checker

AI Detector

Plagiarism Checker

AI Humanizer

AI Chat

Translate

Summarizer

Citation Generator

Do it all with QuillBot

Others also liked

What Is a Conceptual Framework? | Examples & Tips

Purposive Sampling | Definition & Examples

What Is Convergent Validity? | Definition & Examples