Knowing whether your data are from a population or a sample is key to properly analyzing or interpreting your results.
If your data are from a subset of the group you are studying, your data represent a sample. If instead your data have been collected from every single individual you are interested in studying, your data are from a population.
If you are analyzing data you did not collect yourself, consider how likely it is that the researchers who collected this data gathered measurements from every single individual they were interested in studying.
Researchers generally collect data from a smaller group and use the results to make inferences about the population, so there’s a good chance that these data are from a sample rather than a population.
Continue reading: How do I know if my data are from a population or a sample?
In statistics, population parameters are characteristics that describe a population (such as mean, standard deviation, and variance). They are calculated using the data from every member of the group you want to learn about, so they provide a completely accurate description of that population.
Sample statistics, on the other hand, are calculated from a sample (a subset of the population). Sample statistics provide an estimate of population parameters, but because they do not include data from every member of the population, they may be biased or inaccurate.
Continue reading: What are sample statistics vs population parameters?
Sampling bias occurs when some individuals in the population are more likely to be included in a sample than others. This can limit how well results generalize to the broader population.
Sampling methods like probability sampling help reduce sampling bias because every individual in the population has a known, non-zero chance of being included in the sample. However, it’s difficult to eliminate sampling bias entirely, so results from a sample should always be interpreted with caution.
Continue reading: What is sampling bias?
Simple random sampling is a probability sampling method. Individuals are selected randomly from a list of all members of the population (the sampling frame), so everyone has an equal chance of being included in the sample..
This method has reduced sampling bias compared to other sampling methods, but it can be more difficult to conduct. It requires a complete list of the population and does not consider how easy or difficult it is to reach selected individuals.
Continue reading: Is simple random sampling probability or nonprobability sampling?
Sampling is the process of selecting a subset of individuals (a sample) from a larger population.
Because it’s often not feasible to collect data from every individual in a population, researchers study a sample instead. The goal is to use this sample to make predictions (or inferences) about the broader population.
For example, if you want to study consumer attitudes towards a brand, you might survey a subset of customers rather than every single one.
There are different sampling methods that can be used to select a sample.
Continue reading: What is sampling?
Random sampling (also called probability sampling) is a category of sampling methods used to select a subgroup, or sample, from a larger population. A defining property of random sampling is that all individuals in the population have a known, non-zero chance of being included in the sample.
Random sampling methods include simple random sampling, systematic sampling, stratified sampling, and cluster sampling. All of these methods require a sampling frame (a list of all individuals in the population).
The opposite of random or sampling is non-probability sampling, where not every member of the population has a known chance of being included in the sample.
Continue reading: What are random sampling methods?
Correlation tests the strength and direction of a relationship between two variables.
Regression goes a step further: it lets you model the relationship between a dependent variable and one or more independent variables, often using a line of best fit that lets you make predictions about your data.
Continue reading: What is regression?
Pearson’s r (the “Pearson product–moment correlation coefficient,” or simply “r”), is the most common way to compute a correlation between two variables. It tells you how two variables are related. Most statistical tools (like R or Excel) have a built-in correlation function.
The value of r ranges from -1 to +1. The sign of r (+ or –) indicates the direction of a relationship (whether a correlation is positive or negative), and the magnitude of r indicates the strength of the relationship (sometimes called the effect size).
What is considered a strong, moderate, or weak correlation varies by field. Many researchers use Cohen’s size criteria as a guideline:
Cohen’s size criteria
| r value |
Direction |
Strength |
| Between –1 and –0.5 |
Negative |
Strong |
| Between –0.5 and –0.3 |
Negative |
Moderate |
| Between –0.3 and –0.2 |
Negative |
Weak |
| Between –0.2 and +0.2 |
N/A |
No correlation |
| Between +0.2 and +0.3 |
Positive |
Weak |
| Between +0.3 and +0.5 |
Positive |
Moderate |
| Between +0.5 and +1 |
Positive |
Strong |
Continue reading: What is Pearson’s r?
A correlation is a relationship between two variables: as one changes, the other tends to change as well. For example, coffee consumption is correlated with productivity: people who drink more coffee often report getting more done.
Causation, on the other hand, means that a change in one variable directly causes changes in another. To test whether coffee actually increases productivity, you could conduct an experiment: assign some people to drink coffee and others to drink water, and compare their task performance.
It’s important to remember that correlation does not imply causation. Even if coffee consumption and productivity are correlated, it doesn’t mean one causes the other. It’s possible that people who are working more are tired, so they drink more coffee.
Continue reading: What are examples of correlation vs causation?
Causation and correlation are two ways variables can be related.
Causation means changes in one variable directly lead to changes in another (i.e., there is a cause-and-effect relationship). For example, eating food (the cause) satisfies hunger (the effect).
Correlation means there is a statistical relationship between two variables—as one changes, so does the other. However, this relationship is not necessarily causal. For example, although a child’s shoe size and their reading ability are correlated, one does not cause the other (instead, they’re both influenced by a third variable, age).
Continue reading: What is the difference between correlation and causation?