Performance Testing for Gen AI Applications
We recently encountered a significant issue with the costs associated with utilizing an industry-leading Large Language Model (LLM) as a service. Over a period of three months, the expenses for invoking the LLM escalated to over 45,000 USD. This figure excludes the infrastructure, and other costs related to hosting the application that was making these calls.
Upon investigation, it was discovered that the application, which had not yet been deployed to production, had transacted approximately 28 billion tokens (both input and output) during this period with the LLM in question. This raised the question of why such a high volume of tokens was being processed by an application still in the testing phase. The answer: it was undergoing PERFORMANCE TESTING.
Performance testing is a critical requirement before deploying applications to production. However, conducting performance testing on Generative AI applications that utilize external Large Language Models (LLMs) as a service can be prohibitively expensive. This raises the question of how to effectively perform load testing on Generative AI applications?
It is essential to develop distinct strategies for load testing based on whether the Generative AI application is calling LLMs offered as a service (e.g., Gemini from Google or OpenAI models like ChatGPT) or LLMs deployed on virtual machines within cloud subscriptions or projects.
A typical Gen AI application looks like the below -
Let’s break down the key components of a Generative AI application with an example. Suppose we have developed an application to verify whether an insurance claim is valid according to the policy.
- User Interface: This is the screen where the user enters their personal details, describes the claim in free text, and uploads the policy document.
- Orchestrator or Agent: This component receives user inputs and coordinates with various other components to process the claim. It interacts with third-party/enterprise APIs, the business logic library, Large Language Models, and the database.
- Enterprise/Third-Party API: This API checks if the user has any past claims and identifies any rejected claims with reasons to prevent potential fraud. In the case of a successful claim, it can also initiate the payment transfer to the user’s bank account.
- Business Logic: This code validates the insurance period, handles potential fraud cases, and verifies all user-provided inputs.
- Large Language Model: This model uses the insurance document as context and the user’s claim in natural language to determine if the claim is covered by the policy.
- Database: This is where the claim information, approval/rejection decision, and reimbursement status are recorded.
Given the multiple components of the application, it is crucial to conduct performance testing before deploying it to production. However, the performance testing strategy should be tailored based on whether the Large Language Model (LLM) is used as a Software as a Service (SaaS) or deployed on virtual machines within your cloud subscription.
A. When Large Language models is called as a service — When utilizing Large Language Models (LLMs) as a service, it is crucial to avoid making direct calls during performance testing, as this can significantly increase costs. Instead, it is advisable to rely on the benchmarking figures provided by the LLM provider (e.g., Google or OpenAI). For performance testing of the rest of the application, developers should use a proxy rather than calling the actual LLM. This proxy should return dummy outputs, selected randomly from predefined outputs, while adhering to the interface definition. Developers can write a simple Python program with an environment mode set to ‘load testing,’ where the orchestrator will call the LLM proxy instead of the actual LLM. Frameworks such as Langchain offer FakeLLM, which can be used for this purpose. A code sample can be found at https://js.langchain.com/v0.1/docs/integrations/llms/fake/. Additionally, developers can set up a proxy server and call this proxy server instead of the actual LLM. Providers like OpenAI offer an easy configuration method via the `OPENAI_API_BASE` parameter. Refer to https://www.restack.io/p/openai-python-answer-http-proxy-setup-cat-ai for proxy server setup instructions. Furthermore, it is beneficial to log the timings of LLM calls in production. Any increased latency should be addressed directly with the LLM service provider.
B. Large Language Model Deployed on Virtual Machines within Cloud Subscription/Project — First, it is essential to identify what needs to be performance tested: the Large Language Model (LLM) itself or the Generative AI application that calls the LLM as part of its workflow. The Generative AI application comprises multiple components that require performance testing, such as:
- User Interface (UI): Assessing how the UI loading time responds under load.
- Business Logic Code: Evaluating how the business logic handles increased load.
- Auto-Scaling: Testing how the application scales automatically on platforms like Kubernetes under load.
- Database Performance: Monitoring database performance under load.
Performance testing and tuning often occur simultaneously, meaning the same suite of performance test cases may need to be executed multiple times. Therefore, it is advisable to conduct separate performance tests for the LLM and the rest of the application suite. LLMs are typically provided by third parties and function as black boxes, limiting the scope for code-level optimizations. The primary tuning options for LLMs involve horizontal and vertical scaling, such as increasing the number of GPUs. Once the LLM’s performance meets the Service Level Objectives (SLOs), you can use a proxy for the LLM during the performance testing and tuning of the other application components.