Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 742

2024-12-31: Benchmark: Whether LLM agents can evaluate, replicate, and independently conduct the research process

$
0
0



I am excited to announce that Old Dominion University (ODU) is part of the multi-university grant awarded by the Open Philanthropy Foundation to support the development of a systematic benchmark assessing how effectively large language models (LLMs) can evaluate, replicate, and conduct scientific research. The leading institution is the Center for Open Science (CoS, Dr. Brian Nosek and Dr. Tim Errington) and the participation institutions are Pennsylvania State University (Dr. Sarah Rajtmajer, Dr. Qingyun Wu), Notre Dame University (Dr. Meng Jiang), and ODU (Dr. Jian Wu, myself). 

The team will conduct a test of whether LLMs are capable of determining whether claims are true or false. Here a claim means a statement that conveys a research finding in a scientific paper. Our operationalization of this question is whether LLMs can assess a scientific paper and predict whether primary findings would replicate or reproduce successfully in an independent test. In the funding period, we will evaluate LLMs and other methods for accuracy and bias to be trustworthy information sources. We will increase the capacity of LLMs to reproduce and replicate claims. We will also develop guidelines and tools for how LLMs can complement expert judgment in scientific quality control. 

Reproducibility and replicability (Goodman et al. 2016) are fundamental capabilities of research claims. This operationalization has the following virtues. First, it is hard but possible, even for humans. Second, it is feasible to obtain ground truth. Reproducibility and replicability in hundreds of papers in social and behavioral sciences (SBS) have been studied in previous lab projects such as RPP (Reproducibility Project: Psychology). The repliCATS project, which is part of the SCORE project, funded by DARPA, provided expert-inferred replicability labels of at least 2400 SBS papers. Third, it is possible to investigate systematic bias in false positives and false negatives with the given ground truth and various versions of LLMs. Fourth, LLM performance can be compared to other algorithms and human-only methods. Automatic methods have been proposed to assess the reproducibility and replicability of SBS papers, such as machine learning (Wu et al. 2021), deep learning (Yang et al. 2021), and prediction markets (Rajtmajer et al. 2022). The CoS and the Institute for Replication (I4R) have existing research, infrastructure, workflows, and financial support that this effort could leverage directly to accelerate progress on evaluating LLM capabilities. 

The CoS and I4R have both developed the capability to organize hundreds of SBS researchers to conduct replication (same question, new data), robustness (same data, new analysis), and reproduction (same data, same analysis) studies to gather ground truth data that would serve as training and test data for algorithm and LLM methods. PI Nosek and PI Errington have many strong highly-cited publications on reproducibility and replicability  (Nosek & Errington 2020, Nosek et al. 2022, Open Science Collaboration 2015).

Dr. Sarah Rajtmajer is an assistant professor in the College of Information Sciences and Technology and a research associate in the Rock Ethics Institute at Penn State. Her research integrates machine learning, AI and hybrid human-AI systems to understand how information encodes values like accuracy, objectivity, and privacy, and the trade-offs involved in managing healthy information ecosystems. Dr. Rajtmajer is the PI of the SCORE project. 

Dr.  Qingyun Wu is an assistant professor in the College of Information Sciences and Technology at Penn State. Her research includes reinforcement learning, online optimization/learning/decision-making, automated machine learning, and their applications to practical data-driven learning systems. She is well-known for her Github projects AutoGen (36.5k stars at the time of writing) and FLAML (4k starts at the time of writing). 

Dr. Meng Jiang is an associate professor in the Department of Computer Science and Engineering at the University of Notre Dame. His research encompasses many important topics in data mining, machine learning, and natural language processing. In the past years, he published several papers at ACL/EMNLP/AAAI conferences on investigating the reasoning capabilities of LLMs.

I am an assistant professor of Computer Science at Old Dominion University. I participated in the SCORE project. My team built a pipeline that automatically extracted/retrieved more than 40 features that were later used for the prediction market model. My recent publications address the importance of open access datasets and software in reproducibility and replicability in artificial intelligence (Ajayi et al. 2023; Salsabil et al. 2022). Recently, my team discovered a positive relation between the aspect sentiment of citation context and reproducibility for machine learning papers (Obadage et al. 2024). 

The project is expected to last for 2 years, with a total budget of $1.7M

-- Jian Wu 


Viewing all articles
Browse latest Browse all 742

Trending Articles