Build
Why LLMs Like ChatGPT Hallucinate with Your Data and What You Can Do About It
There are five core reasons why LLMs "lie" with your private data. Luckily, there's a path to fix them all.
There are five core reasons why LLMs "lie" with your private data. Luckily, there's a path to fix them all.
At EyeLevel, we often say GPT and other large language models (LLMs) are Harvard professors of the open Internet and first graders of your private data. They’ve never seen your data before and have a lot of trouble understanding it without significant intervention.
It’s this lack of understanding that creates hallucinations with your data.
LLMs “lie” when they run out of facts and don’t realize it, a problem that is greatly exacerbated when building LLM applications based on your private data rather than the open Internet upon which OpenAI and others have already spent considerable energy to improve.
There are several mechanical reasons why this happens and concrete steps you can take to fix them, which I’ll explain below. Importantly, vectors alone can't solve the problems.
One path is to use EyeLevel and our GroundX APIs, the only full stack solution for grounded generation available today, meaning we have the only platform that addresses each of the issues and grounds every LLM response in the facts of your private data.
We built this stack out of our own need for accuracy.
We’ve been building language model AIs for more than 15 years. We pioneered many of IBM Watson’s consumer applications, built the Weather Channel’s forecasting AI that served 2M people a day through Alexa, Siri and Facebook Messenger and have been working alongside OpenAI for almost four years now.
We were developer #20 in OpenAI’s private beta program and have spent the last several years of blood and tears pushing LLMs to provide truthful answers with private data. The EyeLevel platform and our GroundX APIs are the result. We built the tools we needed and realized others could benefit from them too.
Whether you decide to use our tools or build your own, it’s helpful to understand why LLMs fail and what can be done to fix it.
The first issue is a classic GIGO problem (Garbage In/Garbage Out). LLMs need to be fed simple text. Formatting like tables, headers, tables of contents, columns and many other things confuse them. Information inside graphics won’t be understood either. Current OCR algorithms mostly don’t solve the problem.
Some of this might be improved by multimodal capabilities that OpenAI and others are promising. That’s the ability to ingest and interpret images, videos, audio and so on. But those systems aren’t in-market at scale yet and the extent of their future capabilities remains unclear.
To handle this issue, we built a parsing engine that transforms documents into simpler formats for LLMs. We have a library of parsers based on our past work and a scripting language called GroundScript that makes it fairly straightforward to build new ones.
When clients show us new document types or we start to work in a new industry like legal or air travel, we build new parsers and our library improves for others who want to use our APIs.
GroundScript is also an open platform. Anyone can write a parser for their own projects or to help others.
LLMs can only “think” in small blocks of text. So what we do (and competitors) when ingesting documents or databases is chunk the content into small paragraph sized blocks.
However, when you do this you can quickly lose context of what that chunk is about.
Imagine I pull a random paragraph from a book and ask you to tell me what it’s about. You likely won’t know who is talking, what they are talking about or where they are. You won’t know if this is fact or fiction. You won’t even know the content is from a book. Perhaps it’s a medical file, a legal document or a children’s school book.
LLMs are faced with the same dilemma and this context problem is one of the reasons a pure vector approach actually introduces hallucinations into your applications.
To solve this, when we ingest content we run it through a proprietary AI pipeline that classifies, labels and clusters the data as it’s chunking.
That means we extract the relevant context of what the chunks are about and wrap each chunk in metadata that describes it. Each chunk gets wrapped with extra info like what doc is this, who is talking, what are the important issues it’s referencing and so on.
Then we store each chunk in our database with its metadata wrapper and corresponding vector embeddings. It's difficult for competitors to do this step because they are all using a type of database called a vector store.
Vector databases are very good for clustering, which is the process of numerically describing how close ideas are to one another and then storing that nearness as a multidimensional number.
For example, the word apple is near the words orange and fruit in one dimension and near computer in another. It is somewhat near steak (another kind of food) and very far from car, rainbow and machine gun.
This is one of the important methods LLMs like GPT use to understand words and phrases. So it's natural that most developers would use vector databases to store new corporate data. It’s the first thing we tried three years ago. In fact it's what OpenAI and Microsoft tell you to do.
But what they don’t tell you is vector stores can’t hold the metadata that the text chunks need for context later on. Vectors are important, but insufficient to solve the problem.
We use a traditional SQL database structure instead which allows us to store both vectors and metadata and rapidly search both later for answers to user questions.
A second advantage to this approach is it allows for true abstraction of the LLM from your applications. Want to switch LLMs? No need to rebuild your embeddings and retrain the data. With EyeLevel, we can switch LLMs instantly with a drop down box or API call.
Even after these steps, LLMs sometimes don’t say what you think they should. They aren’t perfect out of the box. To improve certain responses it's valuable to have a human in the loop, especially as you’re training on new data.
OpenAI built a massive capacity for this. Human refinement or Refined Learning Human Feedback (RLHF), as it’s formally called, is a large part of why GPT4 is so much better than GPT3.
That’s great for OpenAI, but not helpful to companies ingesting new data. They need their own human refinement tools.
To solve this problem, we built a human refinement tool that lets you fire hundreds of questions at the application, audit every question/answer pair and in a single click edit the content the LLM is using to respond if it’s not quite right.
If you do all of the steps above, your responses are going to improve dramatically. But mistakes can still sneak through and a human can’t be there 100% of the time to refine every response.
We believe what’s needed is a final check on responses that come from an LLM. In our system, we score every LLM response for fidelity to the private data you have fed it. If the response doesn’t meet a threshold for accuracy we block it. And for safe measure, when we ask an LLM for a completion, we actually fire it twice and score both responses.
To be clear, what we’re scoring here isn’t the “Truth” with a capital T. Our system doesn’t know what’s true in the world.
But it does know what data the LLM should be using to answer a question.
That’s because when a user asks a question, we use our AI pipeline to understand the question, rapidly search our database of private content for the text blocks most likely to contain the answer, then send the question, the answer blocks and other prompts to the LLM. We instruct it to answer the question but only use the answer blocks we have sent it for the answer.
So when the answer comes back if it contains names, places or ideas that aren’t in the answer blocks we sent it, we have a very good sense that a hallucination has taken place and we can block that response.
The last piece of the puzzle is making sure the LLM has the proper context for the current question.
Human conversation is a back and forth affair. Your fifth question in a dialogue might reference information gleaned from the first four.
The natural way to handle this is to make sure when sending question number five to the LLM, you also include questions and answers from the first four interactions.
Sounds simple, but there’s a catch.
LLMs force limits on how much data you can send with each request. This is called a token window or a context window.
A token is sometimes a full word, but often it’s just a syllable of a word. Roughly speaking, you need 25% more tokens than words. For example, you can usually transmit 75 words in a 100 token window.
Ok. So in the real world what does that mean?
OpenAI’s GPT 3.5 model had a 4K window which has just been expanded to 16K. Their 4.0 model can handle 32k tokens in each request but is prohibitively expensive for most tasks.
Anthropic, an OpenAI competitor, now allows 100K tokens in a single request.
Whether we’re talking 4K tokens or 100K, that all sounds like a lot of words to ask and answer a question.
But let’s remember the context problem.
To answer that fifth question, we need to send:
You can see how you can run out of token space quickly, especially if using the most common LLM model in production today, GPT 3.5.
What’s needed is some form of memory management that can make judgments on how much room to leave for each of those pieces and how to compress it into the token window you’re dealing with.
EyeLevel’s APIs do this out of the box, with no extra configuration or coding required.
Some speculate that as token windows grow over time, this problem will go away. But we’re doubtful. The history of computing is capacity expands and then so do computing demands. The other aspect is cost. LLMs charge per token. They are happy when you are filling up whatever token window they give you. You might not be as happy when you get the bill. So we think token window optimization will be here for some time.
What we’ve found is controlling hallucinations and generally pushing LLMs to provide more accurate responses requires the stacking of several techniques on top of each other as shown in the chart below.
Many of the most popular tools like Pinecone (vector database) and LangChain (app orchestration) solve just one or two pieces. And unfortunately OpenAI, Microsoft and other LLM providers aren’t giving developers the full picture on what’s needed to truly be successful building LLM-powered applications.
EyeLevel and our GroundX APIs provide a one-stop stack for grounded generation, saving developers thousands of hours of blood, sweat and LLM tears.
Join the AI revolution. Save time, money and resources with the first GPT powered bot built for your business.