Prerequisite — Basic knowledge of Generative AI, Semantic Search, Embedding and Large Language Models.
Even since Chat GPT launched around end of 2022, Large Language Models have come to fore and grab everyone’s attention. The possibilities it has demonstrated are phenomenal and have got everyone excited. Due to its easy availability — individuals are first to adopt it and leverage for their work. In fact, my 10 year old daughter uses it for her homework and intuitively (or based on the knowledge from her friend circle) mentions it to provide answers for a class 5 student.
However, large organizations have been a bit cautious to adopt it as part of their mainstream enterprise architecture due to the following reasons –
1. Transparency — Large Language models are a black box with very little information available around explain ability.
2. Ethical Aspects — As these models are trained on internet scaled data (which is mostly biased) they suffer from inherent bias across regions, race, gender etc.
3. Hallucination — Given the vastness of training data, they often tend to hallucinate and respond out of organizational context and need strong Guard rails and Grounding to be put in place before rolling these out into production.
So what is the solution? Organizations are trying to develop their own LLM models but are facing challenges in terms of –
1. Data — LLMs need large amounts of curated data. Here the volume of data is not enough but also the variety and veracity of the data is equally important. There is a good amount of upfront effort required to get the trainable data for the LLMs.
2. Compute and Cost — LLM training is compute heavy and very costly. Also, upfront there are no guarantees that organization specific LLM will be better than the generic LLMs available in the market.
3. Explainability — As LLMs are inherently blackbox, even these will suffer from being in transparent. Given that even the organizational data can be biased, the trained LLMs could be biased as well.
4. Sustainability — Training a LLM is heavy on carbon footprint and creates a challenge in achieving the sustainability goals promised by the organization.
While the LLM based on organization data will provide answers based on patterns it captured from organizational data; it can still mix up those patterns and hallucinate. Even within organization there are multiple functions e.g. HR, Legal, Finance, IT, Supply Chain and creating a single model for the organization would mean a sparse model (due to presence of multiple functional data). We would ultimately need to create function specific LLMs.
On the other hand, generic LLMs try to solve the problem of hallucination by using grounding techniques such as RAG etc. Basically, these techniques require us to provide enough context information as part of the prompt and have LLM provide responses only from the context information. Given that context windows are only set to increase (Anthropic Claude 2 model have context window of 200K tokens), this approach seems reasonable to convert Generic LLMs which are trained over internet scale corpus and provide large enough organizational knowledge as part of the context to have the LLM’s generative power focused for organization. However, the problem in hand might not have this much context always and this approach has an inherent problem due to the embedding vectors used across the generic data.
What are Embeddings?
Neural Networks (including LLMs) understand only numbers. At the fundamental level, their inputs and outputs are some numbers and they are trained to predict or generate output numbers given the input numbers. For these networks to work with words, we need to convert the words into number vectors. However, the conversion cannot be random — it needs to preserve the relationship between the words for e.g. distance between King and Boy should be approximately same as between Queen and Girl in the number vector space. These number vectors which account for the relationships between the words and sentences are called Embeddings. There are various use cases just for these embedding vectors for e.g. If all your document corpus is stored as embedding vectors and you try to search for a query; you can match the nearby embeddings (as now they are just numbers) and create similarity or semantic search application.
There are various techniques to create these embeddings but the most common follows the principle of ‘Birds of same feather flock together’. What it means is if two words are appearing together in majority of the internet scale documents, they are supposed to represented closely. Say if king and royalty appears together then they will be represented closely in the vector space. Normally the embeddings are created for complete sentences. Most of the embedding models use around circa 2200 words to generate embedding vector consisting of for ~700 dimensions. Creating embedding across data from various industries results in loss of industry knowledge. For e.g. across broad spectrum of data, the following statements will not be closely embedded –
1. SpeakPoint is a web and phone-based ethics concerns reporting tool, operated by an independent service provider, available to employees, external consultants, contractors, agency staff, customers, suppliers, and business partners and those of its affiliates. SpeakPoint is voluntary, confidential, and allows anonymity unless not permitted by a country’s local law.
2. Employees can also use EthicsUp, a third party service provider that allows for anonymous and confidential reporting of unethical or illegal activities. Reporting through EthicsUp can be done via a toll-free hotline or an online reporting platform.
However, if you look both in the context of Contract and Legal Domain, both indicates towards whistle blower policy. Since for generic embedding models (and LLMs), these embedding created would be very different, the applications build on top of generic models will find it difficult to analyse industry specific information. In the above examples, if above 2 clauses are from different companies and we want to build an application on top of Generic LLMs to match the closes, it might fail to identify the similarity in the above clauses.
Generic Language models aim to comprehend the patterns they learn from vast amounts of data on diverse datasets covering a wide range of topics. While powerful, these models may lack the precision required for specialized domains. On the other hand, Domain-specific LLMs are trained on datasets concentrated within a particular domain, such as finance, healthcare, or law. By focusing on a specific field, these models can capture the nuances, jargon, and context unique to that domain, resulting in more accurate and contextually relevant outputs. This also addresses the problem for an organization to create trainable data to large extent since now you can use the data available in internet and curate it. Following are some of the examples of Domain specific LLMs.
1. Healthcare: Domain-specific LLMs in healthcare can aid in medical document summarization, generate reports, and assist in diagnosing medical conditions. Google has recently launched healthcare domain specific LLM — Med PaLM2.
2. Finance: In the financial sector, these models can analyze market trends, generate financial reports, and assist in risk assessment.
3. Legal: For legal professionals, domain-specific LLMs can review contracts, draft legal documents, and provide nuanced legal insights. Their training on legal texts enables them to navigate the intricacies of legal language.
We can use multiple domain specific model aligning to the right use case and provide further organization specific information as part of the Context of the prompt. We can still use the grounding techniques such as RAG with organization data. This way we get superior results as compared to Generic LLMs yet do not face the challenges required for creating organization specific models.
In conclusion, adopting Domain-specific language models will help organization get very focused responses aligning to the domain and organization yet organizations would not have to worry about training an LLM model.