What is Criterion Validity? | Definition & Examples

Criterion validity (or criterion-related validity) is an assessment of how accurately a test or instrument captures the outcome it was designed to measure. These outcomes are generally constructs that cannot be directly measured, such as intelligence or happiness. Such constructs occur frequently in psychology research.

Criterion validity is determined by comparing your test results to a “gold standard,” or criterion, that acts as a ground truth. If your test and the criterion are measuring the same construct, they should be highly correlated (i.e., have high criterion validity).

There are two types of criterion validity, which differ in their timelines of comparison. Concurrent validity compares two measures administered at the same time, whereas predictive validity captures how one measure correlates with a second measure taken in the future.

Criterion validity example
A technology company has created a watch that uses biometric data to estimate users’ stress levels. However, they want to ensure that their estimate of stress has criterion validity. This can be done by measuring either concurrent or predictive validity.

  • Concurrent validity could be assessed by comparing the watch’s real-time estimate of stress levels to a validated measure of stress administered at the same time, such as the Perceived Stress Scale.
  • Predictive validity could be determined by examining whether initial watch estimates of stress correlate to future health outcomes associated with chronic stress, such as hypertension.

If the watch estimates are highly correlated with the Perceived Stress Scale or future measures of hypertension, the company has successfully established criterion validity.

What is criterion validity?

Criterion validity is used to measure how well a test corresponds to a well-established measure called a criterion. If the test and criterion measure the same phenomenon, their results should be closely correlated.

In psychology and other social sciences, many phenomena of interest are constructs that cannot be directly measured. Researchers must instead create tests or measures that capture behaviors or performance related to these constructs.

To validate how well such a test captures the construct it’s supposed to measure, it should be compared to a “ground truth” measurement. But how do you determine the ground truth of something that can’t be directly measured?

Criterion validity offers a way around this issue. A new test designed to measure a construct can be validated through comparison to an existing “gold standard,” or criterion, that acts as the ground truth. Criterion variables could include any of the following:

  • A clinical assessment
  • A well-established questionnaire
  • A measure of some related behavior

Most importantly, the criterion should measure the same or a similar construct as the test being validated.

If a clear criterion or gold-standard exists, assessing criterion validity is relatively straightforward: you simply obtain your test and criterion measures and compute the correlation between them. However, if no gold standard exists, criterion validity cannot be established.

Criterion validity is also only as good as the criterion used as a benchmark. If there are biases or errors in the criterion, high criterion validity merely indicates that the measure being validated has the same problems.

Types of criterion validity

The two types of criterion validity are concurrent validity and predictive validity. Both involve comparing the results from one test to a second target outcome or criterion. However, these types of validity differ in when the test and criterion measurements are taken:

  • To assess concurrent validity, the measure of interest should be compared to a widely accepted measure obtained at the same time.
  • In contrast, predictive validity is determined by comparing a measure to an outcome that occurs later in time.

Though both concurrent and predictive validity are measures of criterion validity, they provide slightly different information about a measure.

Concurrent validity is most helpful when you are evaluating a new instrument or test that offers improvements (in cost, ease of use, etc.) over an existing approach. By obtaining both measurements at the same time and comparing them, you can determine if your new measure captures the same construct as the existing criterion.

Predictive validity is useful in determining whether an instrument is correlated with some future criterion. This can be helpful if you would like to know if your instrument can be used to predict an outcome that occurs later in time.

Concurrent vs predictive validity example
A pet adoption agency has developed a personality test to help match animals with the best new homes. The type of criterion validity they use to assess this new instrument will depend on what they would like to know about their new test.

If they would like to determine whether this test provides an accurate assessment of personality compared to the industry standard, the agency would determine concurrent validity. They would do so by conducting their test and the existing test simultaneously and calculating the correlation between the two.

If they would like to know if their test is predictive of a future outcome, like an animal’s stress levels in their new home, the agency would assess predictive validity. They could administer their personality test, wait one year, and then determine if the original results were correlated with a measure of the animal’s stress levels in their new home.

Predictive validity may be considered more informative than concurrent validity, as it can help determine if a test provides information about a future, real-world outcome. However, predictive validity takes much longer to assess than concurrent validity, as there is a necessary delay between administering the test and criterion.

Criterion validity example

Consider the following example of why and how a psychology researcher might measure criterion validity.

Criterion validity example
A social psychologist is interested in how the happiness of individuals relates to relationship longevity. They have developed an engaging online questionnaire to measure happiness.

The psychologist would like to (1) confirm that their questionnaire can be used instead of existing measures and (2) determine if its results predict the duration of new relationships. Both questions can be addressed by measuring different types of criterion validity:

1. To assess whether the questionnaire measures happiness, the psychologist can determine its concurrent validity. Their criterion in this case could be the Subjective Happiness Scale, a validated measure of happiness. They would administer their questionnaire to a group of participants alongside this criterion.

If the scores of both instruments are highly correlated, the psychologist can conclude that their questionnaire has high concurrent validity—it assesses happiness at least as well as the Subjective Happiness Scale.

2. To determine if their questionnaire predicts relationship longevity, the psychologist can instead measure predictive validity. They could use relationship duration as their criterion. Newly dating couples would fill out the questionnaire, then complete a follow-up three years later to see how long their relationship lasted.

If the questionnaire results correlate with relationship duration, the psychologist has established the predictive validity of their questionnaire.

How to measure criterion validity

Criterion validity is determined by comparing a test with an outcome (the criterion) that measures the same or a similar construct. This outcome may be measured at the same time as the test to be validated (in the case of concurrent validity) or at a later point in time (to determine predictive validity).

The test and outcome are compared by determining the correlation between them. The strength of correlation between two variables is determined by computing the correlation coefficient, such as Pearson’s r.

If both the test and criterion provided similar measurements, they will be highly correlated. You could therefore conclude that your test had criterion validity.

Construct vs criterion validity

Construct validity captures how well a test or instrument measures the phenomenon it’s supposed to measure. Criterion validity assesses how well a test measures a construct by comparing it to some benchmark measured at the same time (concurrent validity) or in the future (predictive validity).

Construct validity is generally considered the primary focus when validating a new test. Other measures of validity—criterion validity, content validity, and face validity—provide evidence of construct validity.

Frequently asked questions about criterion validity

What are the two types of criterion validity?

Criterion validity measures how well a test corresponds to another measure, or criterion. The two types of criterion validity are concurrent and predictive validity.

What is a construct?

A construct is a phenomenon that cannot be directly measured, such as intelligence, anxiety, or happiness. Researchers must instead approximate constructs using related, measurable variables.

The process of defining how a construct will be measured is called operationalization. Constructs are common in psychology and other social sciences.

To evaluate how well a construct measures what it’s supposed to, researchers determine construct validity. Face validity, content validity, criterion validity, convergent validity, and discriminant validity all provide evidence of construct validity.

What is the difference between content and criterion validity?

Content validity and criterion validity are two types of validity in research:

  • Content validity ensures that an instrument measures all elements of the construct it intends to measure.
    • A survey to investigate depression has high content validity if its questions cover all relevant aspects of the construct “depression.”
  • Criterion validity ensures that an instrument corresponds with other “gold standard” measures of the same construct.
    • A shortened version of an established anxiety assessment instrument has high criterion validity if the outcomes of the new version are similar to those of the original version.
Is this article helpful?
Emily Heffernan, PhD

Emily has a bachelor's degree in electrical engineering, a master's degree in psychology, and a PhD in computational neuroscience. Her areas of expertise include data analysis and research methods.