Unlocking the Future with Agentic AI
Generative AI leveraging Large Language Models (LLMs) became boardroom discussion topic after OpenAI launched ChatGPT towards end of 2022. What followed next year was lot of euphoria around capabilities of Gen AI leveraging LLMs. Everybody jumped on the bandwagon, lot of POCs were done and few POCs made it to production as well. Towards end of 2023/mid 2024, it was pretty evident that Large Language models are good at language tasks but not so good when it comes to logical or reasoning tasks or mathematical tasks.
While this was initially tried to circumvent by various prompt engineering techniques such as chain of thought prompts; however, only limited success was achieved. For e.g.
When leading LLM’s are prompted — What is the derivative of x³ with respect to log(x)? Majority were able to break this down to small problems and apply chain rule to first find derivative of x³ w.r.t x and chain it with derivative of x w.r.t log(x). However, LLM’s made a mistake in one of the steps and gave 3x² as answer (which is incorrect).
There could be primarily 2 reasons behind this limitation -
- Training and Architecture of LLMs — Large Language Models follow transformer architecture and are trained using self-supervised techniques. In this mode, large information of data available in internet primarily such as wikipedia, news, libraries etc was fed into large language models in small sequences called context windows and Large Language Model was trained to predict the next words. These models are also called as Auto regressive models. Due to this training approach Large Language Models are just trying to predict the next word in the sequence and hence there is no critical thinking that was built in LLM directly during the training. With large corpus of data being used to train over billions of parameters in Large Language Models, they have become very apt in understanding patterns of textual information, comprehend the information, summarize or basically perform all tasks that are textual in nature. They lack critical thinking since they are not trained to think critically. They are like the students who have learned all their reasoning skills and mathematics in literature class. For training in critical thinking or tasks that involve mathematical calculations or reasoning, the LLMs need to be trained by providing first principle mathematical concepts like operators and operands. This also pose a serious challenge to find huge amount of data and the fact that it will not be just next word self-supervised learning but a supervised training involving performing operations on operands and combining lots of quantities and different types of those with label consisting of results of those. Graph database would be a good fit to hold this data but generating this data remains a challenge.
- Understanding of Human Brain — The second reason is more to do with our understanding of human brain. So far, we have created mathematical model of a single neuron and connected billions of them in various ways to create AI models (including Large Language models) and train them with data to adjust their weights. Given the training data, any neuron can adjust its weights to learn any pattern. However, human brain does not work in this fashion — we have different parts of the brain responsible for different functions for e.g. Left hemisphere of the brain is responsible for quantitative and analytical thinking whereas the Right brain is responsible for language or creativity. Similarly prefrontal cortex is responsible for high level cognitive abilities that develop into maturity such as planning, prioritizing, problem solving, suppressing impulses etc.
Given that both the above challenges are here to stay for at least few more years, we need a better mechanism to handle short comings of LLMs and hence AI framework was created. Agentic framework mimics how humans perform tasks. Just like humans use memory, additional information or learning, cognitive planning and finally action to perform a task; AI Agents also mimics the same behavior. AI Agents not only taps into LLMs but also call multiple APIs or make database calls to get additional information (which is passed as context to LLM), plan for the steps and call external APIs to perform certain action. The part of Agent which is responsible for all the planning, reasoning, cognition and decision making can be referred to as brain of the agent. At times, Agentic framework deploys another Large Language Model to act as reasoning engine and generate code to perform certain mathematical tasks. While all the individual tools of the puzzle like APIs calls, code, LLMs etc work perfectly fine, the key success factor of the Agent is its brain which plans and decides when to call which tool and to create a well formatted input to be passed to the tool.
Though these tasks can be solved by other tools instead of Large Language Models — the trick is to identify when to switch over to tools and when to trust LLM’s to get the correct answer. Using ‘Agents’ as building blocks to create GenAI applications is the right approach to solve this limitation.
We can combine multiple Agents to create a multi-agentic application. Here the orchestrator (reasoning agent) calls other agents depending upon the task. Each agent completes its task based upon the overall plan of the orchestrator.
For e.g. — Given below is a multi-agentic application for creation of Marketing Videos of a Software Product from overall product documentation.
- User requests the application to generate a short video of product ‘xyz’.
- Agent 1 calls LLM and plans for how it will generate the short video. This involves calling other agents for their individual tasks.
- Agent 1 calls Agent 2 first with the task to generate product feature images with short descriptions.
- Agent 2 takes product name as input and performs semantic search on product catalogue vector DB to retrieve all the information available for the product including the images.
- Agent 2 calls LLM to generate small summary of approx. 1000 words describing the characteristics and value addition of the software.
- Agent 2 calls multimodal LLM to extract all the images for the product.
- Agent 2 takes the summary output of Agent 2 along with screen shots of Agent 1 and calls an internal application to intervene the summary parts with individual screen shots.
- Agent 1 then calls Agent 3 with the information received from Agent 2.
- Agent 3 takes the output of Agent 2 and calls external API for text-to-voice generation.
- Agent 3 then calls video generation API with the inputs of images and voice overs to generate the short video.
Here interesting bit is that if some other user does not want to generate short video but want to share product summary with external user via email, he would request to email product summary of ‘xyz’ product and Agent 1(planner and orchestrator) would just call Agent 2 to get the summary document and will mail that to user instead of calling Agent 3. You can in this manner create versatile business applications using Agentic AI.