Testing Gen AI Applications
A Gen AI application is any application that leverages Large Language Models and has user interface to get user inputs to act upon. For this article we would not consider code generation or other applications that are primarily aimed towards enhancing productivity and have the output vetted by developer or actor before using.
Gen AI application could range from a simple summarization application to RAG based Semantic Search to a multi agentic application. As Organizations race to create Gen AI applications, there will be need of a testing function to certify those applications to be fit for production deployment. Due to inherent limitations of Large Language Models, business would look upon human testers to sign off these applications before confidently putting into production.
However, testing of Gen AI Applications is different from normal IT applications testing. There are a lot of aspects that need to be considered to ensure that production application would not go rogue. When it comes to Gen AI Applications testing, the boundary between functional and non-functional testing gets blurred. Also, testing should cover the application against underlying LLM vulnerabilities.
Following are the key considerations that any tester needs to account while testing a Gen AI application –
- Does the Gen AI Application handle bad user behaviours such as —
a. Entering malicious inputs.
b. Engaging in prompt hacking (Jailbreak or prompt leakage or prompt injection)
c. User trying to use the application beyond its intended purpose (including not supporting banned topics within organization).
2. Does the Gen AI application draw clear boundary of the scope and gracefully respond whenever its scope is breached?
3. Does the Gen AI application gracefully turn down the request if it is unable to provide correct response (instead of hallucinating)?
4. Does the Gen AI Application protect sensitive information (it should not be sharing and storing sensitive data)?
5. Does the Gen AI Application ground its response and substantiate its responses by providing reasonable explanation?
6. Does the Gen AI Application also ensure that responses are coherent, fluent and safe (non-abusive)?
7. Does the Gen AI Application respond to a wide variation of inputs in natural languages as long as it is able to understand the intent (and extract required information/entities) from user inputs. It should not be a stickler for grammar and structure and should rather work on semantics?
8. Does the Gen AI Application support asking same question/request in variety of ways ?
9. For the actions that Agentic Gen AI Application will take on behalf of user, does it provide enough opportunity for the user to confirm before the actual action. Essentially there should be Human in the loop feedback mechanism built in when the Agent is going to take some actions which can have either financial impacts or alter the state of the system)?
10. For the Agentic Gen AI Application that implement workflow (orchestrating multiple system calls) to fulfil the request, the agent should share intermediate steps as well as inputs provided while calling external applications clearly. It should clearly explain how the end response was formulated to give enough information/confidence to the user to accept/reject the response?
11. If the Gen AI Application deploys cache mechanism, the separate tests should be done to clearly check when the response is being received from cache and when it is not. Also, how cache behaves across users, contexts and duration should be checked. How it should be ensured that latest information is being reflected while responding to user queries instead of servicing always from cache. Purging of stale cache should be tested?
12. If the Gen AI Application gives option to select LLM(s) to user then all the above should be checked specifically for every LLM?
Lastly while cost aspects are normally not covered but in Gen AI application; tester should also check the cost aspects as well.
- Identify what are the fixed costs — that application will incur while just being up (for e.g. hosting a 3rd party LLM on GPU enabled VMs incur fixed cost irrespective of its usage) ?
- Identify what is the average cost per call and extrapolate application running costs while in production?
Given the above considerations, it is important to understand how a Gen AI tester differs from normal IT application tester. Below are the key skillsets of a Gen AI tester.
- Understand Prompt Engineering
2. High Level understanding of how LLM works and their vulnerabilities
3. Ability to Validate with Unstructured Data
4. Well verse with natural language semantics.
5. Functional knowledge of the application.
6. Critical Thinking
7. Non-Functional Aspects such as Ethical AI, in addition to Security, Performance, Throughput.
8. Understanding of Operational Costs