Figure 1: Overall process of applying conformal prediction for uncertainty quantification in LLMs (Figure 2 in paper)
Introduction
As large language models (LLMs) gain prominence in academia and industry, evaluating their performance becomes increasingly critical. Popular platforms such as the HuggingFace Open LLM Leaderboard focus solely on accuracy, neglecting an essential dimension: uncertainty. Uncertainty quantification (UQ) is vital for comprehensive model evaluation, as two models may achieve similar accuracy but show different levels of uncertainty as shown in Figure 2 below. In this blog, we summarize the paper "Benchmarking LLMs via Uncertainty Quantification" [1] by Fanghua Ye et al., presented at the NeurIPS 2024. This paper introduces a novel benchmarking framework that incorporates uncertainty quantification using conformal prediction [2]. The paper quantifies the uncertainties of nine LLMs across five core natural language processing (NLP) tasks, revealing valuable insights into the interplay between model size, accuracy, and uncertainty.
Figure 2: An illustration of two LLMs accurately predicting the true answer but showing different levels of uncertainty (Figure 1 in paper)
Methods: Conformal Prediction for LLM Uncertainty
Conformal prediction is a framework designed to quantify uncertainty in model predictions. It transforms model outputs into statistically reliable predictions by constructing a "prediction set" that is guaranteed to contain the correct answer with a specified confidence level. This paper employs conformal prediction as a model-agnostic, efficient, and statistically rigorous approach to UQ.
Unlike Bayesian methods or entropy-based measures, conformal prediction is distribution-free and relies on constructing prediction sets with a guaranteed coverage probability. These sets provide insight into model uncertainty through their size—smaller sets imply higher confidence and vice versa. Specifically, conformal prediction involves the following steps:
- Heuristic notion of uncertainty based on the model's prediction
- Select a score function to rank or assign a score known as conformal score (a value used in conformal prediction to quantify how well a model's prediction aligns with the observed data) to each of the predictions. A higher score indicates less confidence in that prediction. The authors adopted the Least Ambiguous Classifier (LAC) [3] and Adaptive Prediction Sets (APS) score functions. LAC focuses on minimizing prediction set size, while APS balances accuracy and uncertainty.
- Calibrate the Model: To ensure reliability, conformal prediction uses a small set of labeled data (calibration set) that is separate from the test data. The model's conformal scores are computed for this calibration set to establish a threshold. This threshold determines which predictions are included in the final prediction set.
- Construct the Prediction Set: For each new instance in the test set:
- The model generates conformal scores for all possible predictions.
- Predictions with scores below the threshold are included in the prediction set.
- The size of this prediction set reflects the model's uncertainty - smaller sets indicate higher confidence.
- Ensure coverage: The conformal prediction method guarantees that the prediction set contains the correct answer with at least a user-specified confidence level. In this paper, the authors specified a 90% confidence level.
Evaluation Tasks and Datasets
This paper quantifies the uncertainties of various LLMs across five core NLP tasks, each formulated as a multiple-choice question-answering (MCQA) problem. These tasks include the following:
- Question Answering (QA): Tests factual knowledge using the MMLU dataset, covering topics like STEM, humanities, and law.
- Reading Comprehension (RC): Assesses understanding of narratives and reasoning beyond text with the CosmosQA dataset.
- Commonsense Inference (CI): Evaluates reasoning about relationships and events using the HellaSwag dataset.
- Dialogue Response Selection (DRS): Tests conversational coherence with the HaluEval dataset (HaluDial).
- Document Summarization (DS): Measures summarization skills using the HaluEval dataset (HaluSum) derived from CNN/Daily Mail articles.
Evaluation Prompts and Metrics
The authors implemented three types of prompts to mitigate the LLMs' sensitivity to input phrasing:
- Base Prompt: The authors combined the questions and options directly.
- Shared Instruction Prompt: They added a general task description to guide the LLMs.
- Task-Specific Instruction Prompt: They provided detailed, task-specific guidance.
These prompting strategies help reduce the LLMs sensitivity to prompt variability. To quantify the uncertainties of the LLMs, the authors use the accuracy (ACC), set size (SS) (size of the prediction set), and coverage rate (CR) (percentage of instances where the correct answer is within the prediction set) metrics. To ensure fairness, results are averaged over the two conformal scoring functions (LAC and APS).
Evaluation Models
The study benchmarks a diverse set of nine open-weight LLMs, representing a range of architectures, scales, and training methodologies. Specifically, the chosen models include the Llama-2 series (7B and 13B), Mistral-7B, Falcon-7B, MPT-7B, Gemma-7B, Qwen series (7B and 14B), Yi-6B, DeepSeek-7B, and InternLM-7B. This selection allows for a comprehensive analysis of model performance in terms of both accuracy and uncertainty.
Comparison to Other Uncertainty Quantification Method
The study explores how conformal prediction compares to UQ methods such as Shannon Entropy and Expected Calibration Error (ECE) [4]. Entropy, often used in NLP, measures the uncertainty in the distribution of predicted probabilities. Although entropy-based methods are straightforward, they lack a direct connection to accuracy. For example, permuting predicted probabilities does not change entropy, even though it can significantly affect prediction accuracy. To make entropy comparable to conformal prediction, the authors converted it into perplexity [5].
ECE is a metric used to quantify the difference between predicted probabilities and the actual correctness of those predictions. It measures how well a model's confidence or softmax scores align with the true likelihood of being correct.
Evaluation Setup
The dataset used to evaluate the nine LLMs for the QA, RC, CI, DS, and DRS tasks contains 10,000 instances, 50% of which are used for calibration and the remaining 50% for testing. The confidence level for the conformal prediction is set to 0.9, meaning the prediction set should contain the correct answer with at least 90% probability.
Evaluation Results
Based on the results shown in Table 1 below, we observe that Qwen-14B consistently achieves the highest accuracy, leading in QA, RC, CI, and DRS tasks, while Yi-6B and Gemma-7B followed closely with strong performances. Regarding uncertainty, Gemma-7B exhibited the lowest average prediction set size, indicating higher confidence, with Qwen-14B and Mistral-7B also showing low uncertainty.
Interestingly, larger models such as Qwen-14B and Llama-2-13B tend to display greater uncertainty compared to smaller models like Yi-6B for the DS task, highlighting that increased model size does not always reduce uncertainty. Instruction-finetuned models generally exhibited higher uncertainty, as seen with Llama-2-7B.
In addition, most models met the 90% coverage rate target, with the lowest being 89.56% for Qwen-7B on the DS task. Moreover, higher accuracy does not always correlate with lower uncertainty; for example, InternLM-7B outperforms MPT-7B in accuracy on the DRS task but shows higher uncertainty.
Overall, the results emphasize the importance of evaluating both accuracy and uncertainty for a comprehensive understanding of LLM performance.
Table 1: The evaluation results of LLMs with sizes ranging from 6B to 14B (Table 1 in paper)
Comparing UQ Results Across Tasks
To demonstrate the superiority of conformal prediction, the authors conducted additional experiments to compare it with entropy and maximal predicted probability in terms of the ECE. Based on Table 2, conformal prediction achieved the lowest average ECE of 8.35%, indicating more reliable uncertainty estimates compared to entropy (8.46%) and Pmax (8.61%).
Across individual tasks, conformal prediction matched or outperformed the other methods, particularly in CI and DRS tasks. These results demonstrate that conformal prediction offers superior calibration, providing better uncertainty estimates compared to traditional heuristic-based approaches like entropy and maximal predicted probability (P_max).
Table 2: Comparison among conformal prediction (CP), entropy, and maximal predicted probability (P_max) using InternLM-7B (Table 3 in paper).
Conclusion
In this paper, the authors explore a comprehensive benchmarking approach for evaluating large LLMs by incorporating UQ using conformal prediction. Traditional evaluation methods, which focus solely on accuracy, overlook the critical aspect of model confidence. By assessing nine open-source LLMs across five key NLP tasks, the paper reveals that higher accuracy does not always correlate with lower uncertainty and that larger models can exhibit greater uncertainty than smaller ones. The comparison of UQ methods demonstrated that conformal prediction provides more reliable and calibrated uncertainty estimates compared to entropy and maximal predicted probability.
References
1. Ye, F., Yang, M., Pang, J., Wang, L., Wong, D., Yilmaz, E., Shi, S., & Tu, Z. (2024, January). Benchmarking LLMs via uncertainty quantification. https://doi.org/10.13140/RG.2.2.19298.71360
2. Angelopoulos, A. N., & Bates, S. (2023). A gentle introduction to conformal prediction and distribution-free uncertainty quantification. Foundations and Trends® in Machine Learning, 16(4), 494–591. https://doi.org/10.1561/2200000101
3. Sadinle, M., Lei, J., & Wasserman, L. (2018). Least Ambiguous Set-Valued Classifiers With Bounded Error Levels. Journal of the American Statistical Association, 114(525), 223–234. https://doi.org/10.1080/01621459.2017.1395341
4. C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, 2017, pp. 1321–1330. https://doi.org/10.5555/3305381.3305518
5. Jelinek, F., Mercer, R.L., Bahl, L.R., & Baker, J.M. (1977). Perplexity—a measure of the difficulty of speech recognition tasks. Journal of the Acoustical Society of America, 62. https://doi.org/10.1121/1.2016299