The verification and validation of software components are based on extensive testing. The required test cases to enable testing are derived from the specified requirements, which are then executed, and the results are compared with the acceptance criteria of the test cases. Even for relatively small systems, the derivation of test cases is a resource-intensive and therefore expensive endeavor. Assuming a conservative estimate of 5–10 minutes per test case, it may take more than twenty person-days of effort to write test cases for a system with around 500 requirements. By leveraging Large Language Models, (LLMs), we can increase the efficiency of test case generation.
The development of complex systems starts with their requirement specifications. For dependable and safety-critical systems, this also includes safety requirements based on the safety analysis of the system. Deriving the test cases manually for these requirements is a time-consuming process. However, by leveraging LLMs, we can improve this process. As input for the LLM, a textual representation of the requirements is used, which is then autonomously transformed into test cases and scenarios in either plain-text format or any formal specification, such as ASAM Open Test Specification. The current best practice of test case reviews by test engineers can ensure the integrity and correctness of these test cases. By using LLMs, the work of the test engineer can be reduced by focusing on formulating test cases for edge cases and reviewing and refining the automatically derived test cases and scenarios.
Using Large Language Models can significantly reduce the time and costs needed to generate test cases.
LLM-based test case generator
We developed an LLM-based test case generator and applied it to a “Lane Keep Assist” use case. Since LLMs may suffer from inherent uncertainties and quality deficits, our basic architecture includes quality and uncertainty evaluation. The table below shows an excerpt of the basic requirements for the chosen scenario, while Figure 1 illustrates the process of deriving test cases using our LLM-based approach.
Requirement ID
Requirement ID Reqif
Category
Requirement Description
1.1
R001
Lane Detection
The system shall detect lane markings on the road using cameras and/or sensors.
1.2
R002
Lane Detection
The system shall identify lane boundaries under various lighting and weather conditions.
2.1
R003
Lane Departure Warning
The system shall provide a warning to the driver if the vehicle is unintentionally drifting out of the lane.
2.2
R004
Lane Departure Warning
The warning shall be provided through visual, auditory, and/or haptic feedback.
3.1
R005
Steering Assistance
The system shall gently steer the vehicle back into the lane if it detects an unintentional departure.
Figure 1: Basic architecture: test case generator for software testing
The requirements as input can be provided in either ReqIF, JSON, or CSV format. The LLM is used to generate test cases based on the given requirements. Since the data within the requirements may be confidential, we utilized our in-house deployed internal LLM tool, which does not expose information.
To maintain confidentiality of the requirements and the generated test cases, internally deployed LLM models are used.
Automated test case generation
Large language models generate their output based on prompts. For generating the test cases, one can start with a simple prompt, such as “Generate the test case for the following requirement.” However, this may not yield the desired result. Studies have shown that LLMs are easier to work with when provided with prompts that are as concise as possible. Using the guidelines of the standards ISO 26262, we settled on a prompt that specifies in detail the expected output characteristics and attributes of a test case specification.
Quality evaluation
Once we obtain the test cases using LLMs, it is essential to evaluate their quality automatically. Even though we plan to have the test cases evaluated by a test engineer, we can use this to judge the quality beforehand, thereby reducing the time needed by the test engineer for evaluation. Or triggering a new generation of the respective test case if quality defects are detected.
For the quality evaluation, we settled on an evaluation based on content availability and correctness. From the standards (ISO 26262, ISO 29119, etc.), we extracted the attributes required for test cases that must be present. We then evaluated each generated test case to determine if the required attributes are present or missing. Based on that, we assessed content completeness using simple and compound metrics, as outlined below in Figure 2 and Figure 3.
Figure 2: Simple Quality of Conformance metricFigure 3: Compound Quality of Conformance metric
The correctness of the generated test cases can be evaluated against the required criteria. This can either be done manually or automated. For manual evaluation, the criteria defined in the standards, such as ISO 26262 and ISO 29119, are used. The table below shows some of these criteria.
Simple Quality of Content (QoC) and Compound QoC metrics, along with the correctness criteria, can be used to evaluate the quality of the generated test cases. This can even be automated in instances where human-written test cases (true test cases) are available. These test cases can be used to evaluate correctness using techniques such as fuzzy string matching. However, this can be replaced with more sophisticated techniques or even be based on LLMs.
Uncertainty evaluation
Although LLMs have significantly advanced the domain of natural language processing, they still face challenges related to uncertainty. We evaluated the uncertainty of five LLM models, focusing on those that are deployable in-house. The five models evaluated are: Pixtral-12B, LLaMA2, LLaMA3.1 (8B & 70B), and Gemma:27B. The uncertainty evaluation was conducted using existing datasets, such as GSM 8k (for mathematical reasoning, evaluating the ability to solve arithmetic and algebraic problems), Business Ethics (a subset of the MML dataset that measures the model’s understanding of ethical scenarios in business contexts), and Professional Law (a subset of the MML dataset that focuses on legal principles and professional reasoning). Figure 4 displays the results.
Figure 4: Performance comparison of LLMs
Out of all the evaluated models, LLaMA3.1 (70B) and Pixtral were found to be best performant.
Conclusion for software testing
In this work, we introduced a method to automatically generate test cases from requirements using LLMs. We further evaluated metrics to assess the quality of the generated test cases and evaluated the uncertainty of the LLMs. As the next step, we plan to automate the translation of test cases into ASAM Open Test Specification format and execute them.
Every company specifies requirements in different ways: We are happy to generate insights on the improvement potential of our approach for your specific safety requirements shapes in case studies. Contact us today to learn how a collaboration between Fraunhofer IESE and your company can be operationalized? Drop us a message and we’ll arrange an introductory meeting, where we are happy to discuss your project and priorities.
References
ISO 26262 Road vehicles – Functional safety
ISO 29119 Software and systems engineering — Software testing
Agrawal, Pravesh, et al. „Pixtral 12B.“ arXiv preprint arXiv:2410.07073 (2024).
Touvron, Hugo, et al. „Llama 2: Open foundation and fine-tuned chat models.“ arXiv preprint arXiv:2307.09288 (2023).
Dubey, Abhimanyu, et al. „The llama 3 herd of models.“ arXiv preprint arXiv:2407.21783 (2024).
Team, Gemma, et al. „Gemma 2: Improving open language models at a practical size.“ arXiv preprint arXiv:2408.00118 (2024).
Cobbe, Karl, et al. „Training verifiers to solve math word problems.“ arXiv preprint arXiv:2110.14168 (2021).
Hendrycks, Dan, et al. „Measuring massive multitask language understanding. (MMLU)“ arXiv preprint arXiv:2009.03300 (2020).
Cookie-Zustimmung verwalten
Wir verwenden Cookies, um unsere Website und unseren Service zu optimieren.
Funktional
Immer aktiv
Die technische Speicherung oder der Zugang ist unbedingt erforderlich für den rechtmäßigen Zweck, die Nutzung eines bestimmten Dienstes zu ermöglichen, der vom Teilnehmer oder Nutzer ausdrücklich gewünscht wird, oder für den alleinigen Zweck, die Übertragung einer Nachricht über ein elektronisches Kommunikationsnetz durchzuführen.
Vorlieben
Die technische Speicherung oder der Zugriff ist für den rechtmäßigen Zweck der Speicherung von Präferenzen erforderlich, die nicht vom Abonnenten oder Benutzer angefordert wurden.
Statistiken
Die technische Speicherung oder der Zugriff, der ausschließlich zu statistischen Zwecken erfolgt.Die technische Speicherung oder der Zugriff, der ausschließlich zu anonymen statistischen Zwecken verwendet wird. Ohne eine Vorladung, die freiwillige Zustimmung deines Internetdienstanbieters oder zusätzliche Aufzeichnungen von Dritten können die zu diesem Zweck gespeicherten oder abgerufenen Informationen allein in der Regel nicht dazu verwendet werden, dich zu identifizieren.
Marketing
Die technische Speicherung oder der Zugriff ist erforderlich, um Nutzerprofile zu erstellen, um Werbung zu versenden oder um den Nutzer auf einer Website oder über mehrere Websites hinweg zu ähnlichen Marketingzwecken zu verfolgen.