New research introduces benchmark-free method for comparative LLM safety scoring

A new research paper outlines a methodology for evaluating the safety of large language models (LLMs) comparatively, even in the absence of labeled benchmarks. This "benchmark-free comparative safety scoring" approach aims to address the current lack of robust safety evaluation frameworks for rapidly evolving AI models.

A recent research paper has introduced a novel methodology for assessing the safety of large language models (LLMs) in a comparative manner, specifically designed for situations where traditional labeled benchmarks are unavailable. This innovative approach, termed "benchmark-free comparative safety scoring," establishes clear contractual conditions under which scenario-based audits can be interpreted as robust deployment evidence. The study rigorously emphasizes that the validity of these safety scores is contingent upon their derivation under a set of fixed parameters, including specific scenario packs, consistent evaluation criteria, designated auditors and reviewers, precise sampling configurations, and defined rerun budgets. This structured approach aims to bring greater clarity and reliability to LLM safety assessments.This research holds significant importance given the rapid and continuous advancement of LLMs, which has often outpaced the development of comprehensive and standardized safety evaluation frameworks. The methodology particularly addresses a critical challenge: the difficulty of verifying model safety before deployment, especially in contexts where labeled datasets are scarce or entirely absent. Such data scarcity is common for specific languages, niche industries, or unique regulatory environments, making traditional benchmark-reliant evaluations impractical. To ensure objectivity and trustworthiness in its assessments, the proposed approach incorporates three distinct validation chains: controlled contrastive reactivity, demonstrating how models respond to specific prompts; the superiority of goal-driven variance, focusing on the model's ability to achieve desired safety outcomes; and rerun stability, ensuring consistent results across multiple evaluations.The practical utility of this methodology was demonstrated through its application in a real-world scenario involving a Norwegian public sector procurement. Here, it was used to compare the safety profiles of the Borealis and Gemma 3 models. A key insight derived from this application was that the determination of what constitutes a "safe" model is not absolute but can vary significantly depending on the specific scenario categories and the risk measurements being applied. Consequently, the paper advocates for a more nuanced reporting of results, suggesting that outcomes should not be oversimplified into a single ranking. Instead, comprehensive reports should include detailed scores, observed differences, calculated risk rates, inherent uncertainties, and explicit identification of the auditors and reviewers involved in the assessment process. This framework is anticipated to play a crucial role in contributing to the standardization of LLM safety evaluations across diverse industrial sectors and evolving regulatory landscapes in the future.Source: https://arxiv.org/abs/2605.06652v1