2024-06-26: Paper Summary: RELIC: Investigating Large Language Model Responses using Self-Consistency

Figure 1 Cheng et al.: RELIC allows users to search for information from large language models (A), view the model’s top response, and understand the variations between response samples to verify the correctness of the generated information (B). For long-form generated text, users can inspect the consistency of each individual claim (C) and find contradicting or supporting evidence from other samples (D). Steps (1-6) illustrate the user’s verification process of InstructGPT’s response regarding Don Featherstone.(Figure 1 in original paper)

Large Language Models (LLMs) have revolutionized natural language processing, achieving remarkable performance across various tasks, such as translation, summarization, and question-answering. Despite their success, understanding and evaluating their responses remains a significant challenge. In the paper"RELIC(Response Evaluation through Linguistic Insight and Consistency)," published inCHI2024 byFurui Cheng, Vilém Zouhar, Simran Arora, Mrinmaya Sachan, Hendrik Strobelt, Mennatallah El-Assady, a novel framework RELIC was designed to investigate and improve the self-consistency of LLM outputs.

Self-consistency refers to the ability of a model to produce stable and coherent responses when queried multiple times with semantically equivalent prompts. Ensuring self-consistency is crucial for reliable deployment in real-world applications where accuracy and dependability are paramount. However, current methods for assessing LLMs often overlook this aspect, focusing primarily on single-instance evaluations that may not capture the variability in model behavior. RELIC addresses this gap by systematically analyzing LLM responses through repeated sampling and consistency checks. By examining variations in outputs and identifying patterns of inconsistency, RELIC provides deeper insights into model behavior, offering a robust metric for evaluation beyond traditional benchmarks. This approach improves both our understanding of LLM performance and the processes of model training and fine-tuning. RELIC was applied to a range of LLMs to demonstrate its efficacy in detecting and addressing inconsistencies. The authors' findings reveal critical insights into the strengths and weaknesses of current models, highlighting areas for future research and development. Through RELIC, the authors advanced the field of natural language processing by promoting more reliable and interpretable model responses.

Methodology

In the quest to enhance user experience and model reliability, the authors have developed a mixed-initiative system that integrates a novel self-consistency-checking pipeline with a user-friendly visual interface. This innovative approach addresses the challenge of interpreting model confidence, which traditional token-wise log probability measures fail to convey effectively to end-users. The study leverages the concept ofsemantic uncertainty, which evaluates the text as a whole and assesses semantic variations across different model-generated responses. By sampling multiple responses to the same prompt and analyzing their logical entailments, the system can gauge the model's confidence level. The more samples that support the focal response, the higher the confidence.To improve scalability and provide detailed explanations for confidence assessments, the researchers propose a computational pipeline. This pipeline is designed to handle long-form text and offer granular insights into the model's judgments. Moreover, a visual interface allows users to interactively explore the self-consistency information. It presents different levels of abstraction, enabling users to drill down into specific claims, explore alternatives, and locate supporting evidence within the sampled responses. This approach not only enhances the interpretability of language models but also empowers users to trust and understand the generated content better.

Algorithm Design - Self consistency check

Figure 2 Cheng et al.: Pipeline of computing the self-consistency of individual claims (Algorithm 1). The input is a user prompt with which an LLM is invoked multiple times. Afterwards, the atomic claims are generated based on the top text response, then turned into questions and answered by all other generations. The answers are then clustered together based on their meaning (e.g., Spanish and from Spain) (Figure 2 in original paper).

The text discusses a method that quantifies semantic-level confidence in natural language generations (NLGs). This method, referred to as 't', is crucial for determining the reliability of NLGs. It not only measures confidence but also correlates these confidence levels with tangible evidence, aiding users in evaluating the trustworthiness of generated content. The core of this approach is encapsulated in Algorithm 1, which outlines the procedural steps and justifications for the method. Additionally, the text promises a thorough quantitative analysis of the algorithm's effectiveness in identifying and mitigating hallucinations in NLGs

Algorithm 1 Cheng et al. (Algorithm 1 in original paper).

The Relic System

RELIC is an interactive system designed to help users of LLMs navigate and correct inaccuracies in text generation. It has four main components:

Prompt Inputter: Where users input their prompts.

Response View: Displays the model's top responses.

Claim View: Allows inspection of individual claims.

Evidence View: Shows supporting or contradicting evidence for the claims.

The process begins with users entering a prompt and reviewing the model's responses. The top response is evaluated using keyword annotations to gauge content quality. If deemed unsatisfactory, the content is rejected. Otherwise, users delve deeper, examining claims and their evidence. Unreliable claims are edited out, and the process repeats until users are satisfied with the accuracy of the information.

Figure 3 Cheng et al.: The Keyword Annotation uses a small word-scale visualization (left) to display the proportions of different categories of samples or alternatives. Upon clicking, users are able to view a list of alternatives and inspect their details (right) (Figure 3 in original paper).

The interface is crafted using React and D3.js, while the server relies on Flask, with spaCy handling text tokenization and segmentation. At the heart of RELIC lies a computational pipeline based on three core NLP tasks: natural language inference (NLI), question generation, and question answering (QA), along with an innovative task called atomic claim extraction. The system utilizes a DeBERTa model fine-tuned for NLI to assess logical connections in text, and a distilled RoBERTa model for QA tasks. These models are sourced from HuggingFace's open-source library with standard settings. For atomic claim generation, RELIC employs OpenAI's text-davinci-003 model, fine-tuned to generate text at a specified temperature setting, ensuring nuanced and contextually relevant outputs. The question generation is similarly handled by the same model, albeit with different prompts.

Result Analysis

The results are summarized based on the insights about the users’ workflow on verifying and correcting the generated text, the comparisons with traditional LLM interfaces, the system’s usability and usefulness, and the participants’ desired improvements.

Figure 4 Cheng et al.: Questionnaire results. Overall, the 10 participants reported high usefulness and satisfaction (Example 5 in original paper) .

Verification Strategies: Participants unanimously valued the number of supporting samples as a key indicator for justifying information. Differing approaches were noted, especially when dealing with unsupported yet uncontradicted information. Some opted for removal if unsupported, while others considered context and seriousness before deciding. Categorizing samples into support, neutral, and contradiction was found to be beneficial for understanding variations.

Correction Tactics: Participants engaged in multiple edits, assessing changes for better accuracy. Unsupported claims were either improved or deleted if alternatives failed. Some employed a strategy of intentionally generalizing details to avoid specific inaccuracies.

Comparative User Experience: Compared to traditional LLM interfaces, RELIC made alternative responses more accessible and less mentally taxing. Participants appreciated the structured interaction with LLM outputs, which allowed for a more analytical editing process. This summary reflects the effectiveness of RELIC in aiding users to discern and refine the accuracy of generated text, enhancing their confidence in the content.

Usability Insights: Participants rated the system as easy to use and learn, with scores around 5.70. Enjoyment in using RELIC was high, scoring 5.90 using a 7-point LikertScale. Confidence in using the system varied, averaging at 4.60, suggesting a learning curve with the novel system.

Desired Improvements: Participants desired the ability to access evidence from external resources, not just the model's samples. A more flexible approach to question posing and evidence retrieval was suggested. Intelligent text editing interactions that balance information richness and confidence were also requested.

Conclusion

The research successfully tackles the complex issue of detecting nonfactual outputs from LLMs. It introduces a shift from analyzing token-level probabilities to evaluating claim-level confidences. This is achieved by measuring self-consistency across multiple text generations, which also facilitates the presentation of alternative keywords within the generated content. A user study validates the system's ease of use and practicality. The key innovation lies in using self-consistency as an indicator of model confidence, empowering users to spot and correct potential inaccuracies. This promotes a user-centric approach to ensuring factual content generation by LLMs. The proposed pipeline is user-friendly and serves as a safeguard against the acceptance of misleading information, often presented with deceptive fluency by LLMs. The conclusion also opens a dialogue on the broader implications of the study, its real-world applicability, inherent limitations, and the potential for future enhancements to the system.

References

Furui Cheng, Vilém Zouhar, Simran Arora, Mrinmaya Sachan, Hendrik Strobelt, Mennatallah El-Assady "RELIC: Investigating Large Language Model Responses using Self-Consistency"inCHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems. DOI: 10.1145/3613904.3641904

-Nithiya

Latest Images

Trending Articles

Latest Images