One recent problem that the field of psychology has been facing is the replication crisis. This means that researchers redo a psychological study with the same procedures as the original, but they may not obtain the same (significant) results. When these findings first came to light, some started to question the validity of psychology as a science due to its lack of reproducible results. Here is a quick overview of the reasons this crisis has occurred.
The whole point of doing psychological research is to have this research published in a journal. Most journals will only publish research if the findings are statistically significant. This is known as the file drawer effect, since any non-significant findings are often put into a file drawer and are not published (PLOS ONE). Why would a journal want to publish a study that says the researchers found no relationship between two variables? Unfortunately, the file drawer effect decreases the amount of studies that are seen by the public. Often, these non-significant findings are not published unless researchers are conducting a meta-analysis. In a meta-analysis, the psychologists look over all previous literature and research on their topic of interest, often asking fellow psychologists if they have unpublished data to include in their meta-analysis. If several researchers have replicated a study and gotten non-significant results and the trend indicates the overall results are non-significant, this is reported in the meta-analysis, and it is clear that the study cannot be replicated.
In order to understand the replication crisis, we need to understand the statistical methods behind psychological studies. Researchers start by asking a question, and then they form a hypothesis. For example, researchers might ask if behaviors are acquired by imitation and observation, or if they are genetic. A hypothesis would be that children who observe an adult model behaving aggressively towards a toy (a Bobo doll) would behave more aggressively when later observed playing with toys compared to control groups of children who were not exposed to adults modeling aggressive behavior (Simply Psychology). This is known as the alternative hypothesis, or the claim that the researchers seek to prove. The null hypothesis assumes that there is no association or effect of the sort that the researchers propose (ex: children who observe an adult model behaving aggressively towards a Bobo doll would not behave more aggressively when later observed playing with toys compared to control groups of children who were not exposed to adults modeling aggressive behavior). In all types of research, the goal is to reject the null hypothesis. Researchers always start out assuming that the null hypothesis is the true value of the association or effect in the population, and their experiment has to prove that the data is statistically different from the null hypothesis to reject it.
How do the researchers reject the null hypothesis and prove their results are significant? When psychological studies report a p-value of less than 0.05, that is considered to be statistically significant. For those who have not studied statistics, a p-value is the probability of getting the sample statistic (from the data), assuming that the null hypothesis is true. The smaller the p-value is, the less likely it is that the null hypothesis is true, meaning that it is unlikely there is no relationship or effect between the variables the researchers are examining. Therefore, the researchers can reject the null hypothesis in favor of the alternative hypothesis.
In order to resolve this replication crisis, psychologists have tried to make many reforms. Most of these reforms involve transparency in the planning and analysis of the study, but as of yet, the scientific community has not reached a consensus on which methods will be most useful to increase the validity of findings (Psychology Today). One method of ensuring greater credibility of the research is hypothesis preregistration. This means that even before the researchers collect data, they are required to publicly report what they expect their results to be. By doing this, researchers cannot simply change their hypothesis if the data suggests an opposite effect from their original expectation. Another method to ensure credibility is through Registered Reports. Journals agree to publish findings that have transparently planned out studies and good methodologies, and they would not take into account whether the findings are significant or not.
In addition to applying these reforms, psychologists and research consumers need to pay attention to other factors that can affect the statistical validity of their results. Outliers will skew the data and could lead researchers to wrongly make the assumption that there is a relationship between two variables, or to wrongly conclude that there is no relationship. Small sample sizes are more prone to the effect of outliers. Larger samples are more likely to be representative of the population.
It is possible to set up the study so that one gets statistical significance, even if there truly is no association or effect in the population. Even in studies that have a really small effect size, if the sample size is large enough, there can be statistical significance. For example, the formula for calculating a t-statistic would be (x-) / (s/n). In this equation, x is the mean of the data set, is the population mean (assumed to be 0), s is the sample standard deviation, and n is the sample size. Increasing the sample size will result in a larger t-value, which makes it more likely that the t-statistic will be statistically significant. Due to this, it has become increasingly more common for researchers to publish the effect size in addition to the p-value (NIH). Examples of effect sizes often reported are Cohen’s d or Pearson’s r correlation.
It is also important to examine the methodology of psychological studies. Typically this is done by looking at construct validity. Researchers should look at each of the variables they are measuring and ask themselves: How well is each variable measured? Are you measuring what you intend to measure? For self-report scales, does your measurement have proper convergent/discriminant validity? For behavioral outcomes, does your measurement have proper criterion validity? Is your measurement reliable? The higher the construct validity, the more likely that these results can be replicated in the future.
Overall, the replication crisis is a problem for psychological researchers. As a whole, the field needs to move away from its emphasis on significant results and towards an emphasis on proper methodology. Once we accomplish this, the likelihood that studies with statistical significance will be replicated will actually increase.
Leave a Reply