Quantcast
Viewing latest article 5
Browse Latest Browse All 736

2024-12-18: Research Summary on Replicability Assessment of SBS Papers

 


According to a report published by National Academies of Sciences, Engineering, and Medicine, "reproducibility refers to instances in which the original researcher's data and computer codes are used to regenerate the results, while replicability refers to instances in which a researcher collects new data to arrive at the same scientific findings as a previous study." In this blog, we will focus on several milestone papers for replicability assessment of SBS (social and behavioral sciences) papers.


SBS includes psychology, economics, politics, sociology, etc. SBS papers typically use statistical testing to test hypotheses. A p-value below a significance level (typically 0.05) can be used as evidence to exclude the null hypothesis. However, as pointed out by a previous study, a p-value provides only partial information about the probability of a tested hypothesis being true. Statistically significant results from small studies are more likely to be false positives than those from large studies. In addition, publication bias in favor of speculative findings makes this problem even more complicated.

It is very challenging to automatically and reliably assess the reproducibility of reported findings in general. There have been progress towards automatic methods to assess reproducibility, from manual replication to less time-consuming automated methods.


Manual Replication

The RPP project (Reproducibility Project: Psychology) is a well-known project on manual replication. It started at the end of 2011, coordinated by the Center for Open Science. RPP spent 3 years on data collection and 1 year to analyze the results. Two reports published in 2012 and 2015 summarized the layout of the project and analyzed the results.

The project involved 72 volunteers from 41 institutions, working on the replication of 100 papers from three top journals in psychologyNinety-seven percent of original studies had significant results (p < .05), while thirty-six percent of replications had statistically significant results. Thirty-nine percent of replications were subjectively rated to have replicated the original result. The mean effect size (a value measuring the strength of the relationship between two variables) of the replications (Mr = 0.197, SD = 0.257) was half the magnitude compared with the original studies (Mr = 0.403, SD = 0.188). Mr refers to moving range and SD refers to standard deviation.

Reproducibility was evaluated based on significance (p-values), effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. In the final assessment, only 39 papers were reported as successfully replicated. 


Prediction Markets

Rigorous replication of published studies is at substantial economic costs and thus rarely done. An alternative method to examine the replicability of SBS papers was the expert-sourced prediction marketsThe prediction markets estimate reproducibility in conjunction with the RPP project, working on 44 studies in the RPP project before the replications are completely done. About 50 researchers from Open Science Framework and the RPP collaboration participated in the prediction markets project in each round. Participants were not allowed to work on the studies where they were involved in replication in order to avoid any bias in the prediction. The prediction market experiments were conducted twice and each of them took two weeks.

In the prediction market, participants bet on each study whether the result would be replicated or not. Then participants traded contracts that pay $1 if they believe the study is replicable and $0 otherwise. This type of contract would reach a price that can be interpreted as the predicted probability of reproducibility. If the price is higher than 50%, then the study is predicted as replicable. Before the trading, there is a survey on the subjective predictions. 

The mean prediction market final price is 55% (range: 13–88%), implying that about half of the 44 studies were expected to be replicable. The prediction markets correctly predicted the outcome of 71% of the replicationsThe survey correctly predicts 58%. As shown in Figure 1, the prediction markets successfully distinguished replicated studies and failed ones.

The results indicated the prediction market was effective. However, it requires a lot of human participation and is hard to scale. Researchers have been putting efforts into automatic replication markets recently with synthetic prediction market proposed. The market is approximated by a dynamical system and asset prices are determined by a logarithmic scoring market rule. This approach is promising and we expect more progress in the future.

Image may be NSFW.
Clik here to view.
Figure 1 (Figure 1 in Dreber et al. (2015)) : Prediction market performance. The prediction market predicts 29 out of 41 replications correctly, yielding better predictions than a survey carried out before the trading started. Successful replications (16 of 41 replications) are shown in black, and failed replications (25 of 41) are shown in red. Gray symbols are replications that remained unfinished (3 of 44).  





Machine Learning Methods

Recently, more cutting-edge techniques have been introduced by researchers with interdisciplinary backgrounds to assess reproducibility. 

Yang et al. (2020) proposed to use deep learning methods to assess replicability. Ground truth data is limited since reliable manual replication projects are scarce and each of them is limited in size. This paper used the RPP manual replication results as training data. The model is built in two different ways: one with narrative text (full text stripped off all non-text contents, for example, graphics), and the other with papers' reported statistics.

Next, in order to quantitatively represent a paper’s narrative content, they used data from the Microsoft Academic Graph (MAG) to train a word2vec model on 2 million scientific abstracts that were published between 2000 and 2017. Then each article could be represented by a 200-dimension vector which contains its unique linguistic information.

In the next stage, they trained the model and predicted the replicability of papers in the test set as shown in Table 1. The experiment results show that the model achieved an accuracy between 0.65 and 0.78. The results are comparable to the results obtained in prediction markets.

Image may be NSFW.
Clik here to view.
Table 1 (Table 1 in Yang et al. (2020)): Training and out-of-sample test datasets 



More recently, Youyou et al. (2023), made an extension on the research previously mentioned. They used a machine learning model similar to the model in Yang et al. (2020) to estimate the replication likelihood for 14,126 articles in Psychology published since 2000. The text representations are the same ones pre-trained on MAG paper abstracts. The training data is slightly different as shown in Table 2.

Image may be NSFW.
Clik here to view.
Table 2(Table 2 in Youyou et al. (2023)): Composition of training data 



They predicted a replication score for each paper, which can be interpreted as the relative likelihood of replication success. They observe a replication score distribution with a range between 0.10 and 0.86, mean = 0.42, median = 0.41, SD = 0.15 as shown in Figure 2. This is consistent with the latest forecasts from prediction markets. They also examined the relationship between replicability and key features of a paper. There is a positive correlation between replication success and authors’ number of publications and citation impact.


Image may be NSFW.
Clik here to view.
Figure 2 (Figure 1 in Youyou et al. (2023)): Replication score distribution for 14126 papers.



Replicability assessment is essential to the academic world, especially for research conduct regulation or reliable literature selection. Manual replication and prediction markets have made great contributions to this field. However, the cost is high and is thus hard to scale. It's more promising to turn to less resource-consuming ways. People recently proposed machine-assisted approaches such as machine learning methods. We expect to see more useful methods and practical applications under this topic.  


References

Open Science Collaboration. "Estimating the reproducibility of psychological science." Science 349, no. 6251 (2015): aac4716.

Dreber, Anna, Thomas Pfeiffer, Johan Almenberg, Siri Isaksson, Brad Wilson, Yiling Chen, Brian A. Nosek, and Magnus Johannesson. "Using prediction markets to estimate the reproducibility of scientific research." Proceedings of the National Academy of Sciences 112, no. 50 (2015): 15343-15347.

Yang, Yang, Wu Youyou, and Brian Uzzi. "Estimating the deep replicability of scientific findings using human and artificial intelligence." Proceedings of the National Academy of Sciences 117, no. 20 (2020): 10762-10768.

Youyou, Wu, Yang Yang, and Brian Uzzi. "A discipline-wide investigation of the replicability of Psychology papers over the past two decades." Proceedings of the National Academy of Sciences 120, no. 6 (2023): e2208863120.


- Xin Wei



Viewing latest article 5
Browse Latest Browse All 736

Trending Articles