How to train ChatGPT on your own text (Chat with your own data, train a text AI to generate content about your docs, book, website, etc)
Create your own text AI
Hello! Josh from Mythical AI here. I had pineapple on my pizza twice this week and loved it. 🍍🍕 This week, we learn how to use your own text and data to make text AIs way more useful!
Large language models (LLMs) are so hot right now.
Using large language models, AI systems can summarize articles, write stories and engage in conversations.
A large language model or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict, and generate text and other content based on its knowledge of the relationship of words, gained from training on huge datasets.
(If you don’t know what deep learning or training means, check out this crash course of AI terms article)
To train an LLM, large amounts of text are given to the AI algorithm using unsupervised learning. The large language model learns words, as well as the relationships between and concepts behind them. For example, it could learn the difference between a dogs bark, and tree bark, based on its context or the words and concepts that surround it. If the word ‘tree’ is in the same paragraph as the word ‘bark’ the LLM learns that it means plant covering not dog sound.
By learning words and their relationship, the LLM can guess what might come next in a sentence or paragraph. The LLM uses this to predict and generate content.
ChatGPT is the most famous and viral of the LLMs, and it is impressive what ChatGPT can do. For example, you can ask ChatGPT to tell you the 3 main ideas of the Declaration of Independence and it does a pretty solid job.
ChatGPT has this ability because its training dataset included the Declaration of Independence. ChatGPT is based on OpenAI’s GPT3 LLM.
There are lots of large language models. I will write about them in a future episode, but for today we are going to focus on GPT3 because it is the most widely used, with the best API currently.
“Hey ChatGPT, let’s talk about the book I wrote.”
But what if you want ChatGPT to tell you the 3 main ideas of something that is not in its training dataset? For example, you want 3 main ideas from your most recent newsletter edition? Currently, ChatGPT can’t do this, because it doesn’t know about your newsletter. Since your newsletter wasn’t in the training dataset, it doesn’t exist.
New AI applications that can pull data from the internet might be able to do this easier in the future. Bing (still in closed beta) for example can ‘google’ concepts and use them. In a fun example on Twitter, Ethan Mollick asked Bing to re-write a story using Kurt Vonnegut’s Rules of Writing which it did after looking them up.
Bing is not widely available. But luckily for us, there are four techniques we can use today, to get text generator models to use our own text and information.
The three methods are:
Give the AI the data in the prompt
Fine-tune a GPT3 model
Use a paid service
Use an embedding database to feed the needed data into the prompt
Let’s dive into all four ways to use your data with GPT3.
1. Give the AI the data in the prompt
The easiest way to have the AI answer interact with your data is to give it the data needed in the prompt.
You can just hand GPT3 a chunk of text and then ask it to interact with it. This works relatively well.
The issue with this method is there is a limit to the amount of text you can give to GPT3.
The limit is there because running AI models is computationally expensive. If you handed GPT3 1,000 pages of text in the prompt, GPT3 would need to go through each word to understand what it is, what it means, and its relationship to other words.
OpenAI has set the prompt input limit at 4,000 tokens or approximately ~3,000 words. What is a token? From this site by OpenAI:
The GPT family of models process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.
You can use the tool below to understand how a piece of text would be tokenized by the API, and the total count of tokens in that piece of text.
If you try to submit something that is too long you will get this error:
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to one token for roughly ¾ of a word, so 100 tokens ≈ 75 words. If you are not sure how much you can submit, paste it into the little tool on the OpenAI website and it will tell you how many tokens your text is.
But what if you want to have the AI use more than 3,000 words?
We need to step up our game. We need to fine-tune GPT3.
2. Fine-tune GPT3 using your own text
Fine-tuning is where you take an existing model, and then add in your own data on top. We take OpenAI’s base model GPT3, and train a new model on a curated dataset that we supply.
This is great because we don’t have to feed billions of words and buy hundreds of GPUs to train our own model from scratch. We just teach GPT3 about our text.
Even better, OpenAI has an official API that lets you do this. All you need is a to run one command pointing at your text, and it will fine-tune you a GPT3 model which you can then use in the playground or via the API.
Unfortunately, it is not as easy (yet) as handing GPT3 all your text.
You have to give it text in a structured format, essentially telling it how it should try to generate text, based on your data and inputs.
You need to create bunch of examples of what the prompt would be, and then what the answer would be. It would look something like this.
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
After you have all these [prompt - generated text] examples, you package them up in a folder and submit them to the OpenAI API.
This requires some knowledge of programming (ChatGPT can help you figure this out as well) but this article walks you through it.
A neat trick, is you can also give some of your text to ChatGPT and ask it to create the prompt completion for you.
For example I handed ChatGPT the entire article about creating consistent characters in Stable Diffusion from last week, and asked it to create question and answered from it.
It did a pretty good job.
I would need to fix, define, and correct a lot of the output, but it is a good starting point.
Once your model is done being fine tuned, you can test it out in the OpenAI playground by selecting the model drop down, and picking the model you created and named.
How much does it cost to fine-tune GPT3 using your own text? 💵
Because processing all of your text costs money, they charge you for this, but they give you a discount. Fine-tuning a model is charged at 50% of the cost of running the model.
The most powerful and expensive GPT3 model, Davinci, costs $0.0200 or 2 cents per / 1000 tokens.
The entire Harry Potter series has 1,084,170 words. So fine-tuning Davinci on just the words in the entire series, one time would cost about 1,084,170 words *.75 = 813,127.5 tokens / 1000 = 813.1 thousand token bundles * 0.02 cents = $16.26
Update March 2023 - OpenAI cut the price by 10x.
However, the default number your tokens are used to train (called an epoch) is 4. So multiple your number of training tokens by 4 to get a more accurate price.
To get good results, you would not be able to just send the text. You would need to create prompts, and then the ideal generated text pairs. So this would inflate the number of
If all of this seems difficult, you are correct. That is why a bunch of companies have sprouted up that automate this. This leads us to our next method of training GPT on your own text.
3. Use a paid service
There are a number of services that let you give them text content, which they will then use to generate a GPT-powered chatbot for you.
I haven’t used any of these services but they all seem like they would work.
https://www.filechat.io/ - Upload your PDF and start asking questions to your personalized chatbot.
https://www.chatbase.co/ - Upload a PDF and ask a GPT-based chatbot to answer questions on it.
https://www.docuchat.io/ - You upload documents. They feed your content to your chatbot.
https://www.humata.ai/ - Created an AI research assistant where you can ask questions about any file (i.e. technical paper, report, etc) in English and automatically get the answer. Humata is like ChatGPT for your files.
https://www.customgpt.ai/ - Ingests your website and creates a chatbot from the content.
Since they are not open source we don’t know exactly what each service is doing, but odds are, they are using the last method, which is ‘use an embedding database to feed the needed data into the prompt’. This is the most difficult method of training a chatbot on your own data but scales the best. Let’s dig into it.
4. Use an embedding database to feed the needed data into the prompt
If you don’t know what embeddings are, please read this section of the article and then come back.
Fine-tuning GPT can be super effective if you have a defined use case. But if you are not sure what users will be asking for, it can be effective to put all your content into an embedding database and then pull out related data and put that into the prompt.
For example, let’s say you wrote a book about places to eat in New York. You could take that content and put it into an embedding database.
Just like latitude and longitude can help you tell how close two cities are on a map, embeddings do the same kind of thing for text chunks. If you want to know if two pieces of text are similar, you just calculate the embeddings for them and compare them. Text chunks with embeddings that are “closer” together are similar.
This would store chunks of your book in a map of space, putting chunks that are more related for example phrases about pizza and phrases about Italian food, closer together and phrases that are less related, for example, phrases about pizza and phrases about soup, further apart.
Next, you accept the user input and embed it as well. This is super useful because then you know how similar the phrase is to your content.
If someone typed in a prompt that mentions spaghetti, they might not have mentioned ‘Italian food’ specifically. But since you have your content and the user phrase embedded, you can see how similar they are. ‘Italian food’ and ‘spaghetti’ will probably be close so you can take the sections of your book that talk about Italian food, put them into the prompt, then return the output to the user.
Here is an example of how that could work.
The user asks ‘What is the best spaghetti in New York?’
You take the phrase What is the best spaghetti in New York?
and embed it with your content.
A passage you wrote about Fiaschetteria Pistoia: Petite Tuscan is one of the best Italian restaurant Fiaschetteria Pistoia makes up for its size with charm — the servers are brusque yet friendly, and everything they make is a celebration of the genre.
related very closely to the user’s question. You know this because they are placed close together in the embedding database. This is important because Fiaschetteria Pistoia has great spaghetti but your text didn’t mention it directly. But because we know how close they those two chunks of text are, we can use that info.
So you take the user prompt and your content about Fiaschetteria Pistoia which has great spaghetti but the text and put them into a prompt that you would then send to GPT.
This will help GPT know your content and it can be very flexible.
Dan Shipper has a great article where he explains exactly how he did this with a Substack newsletter here.
While more complicated you can use embeddings to stuff prompts and this can drastically improve the results you are getting.
You can of course use embeddings along with a fine-tuned GPT3 model for hopefully even more accuracy and value.
Update Feb 2023 - this is why open source is cool. Someone made a library that will do this basically for you. The tweet is inaccurate, it is embedding then feeding relevant info into the prompt, but it is still easier to do now.
Update March 2023 - It gets even easier. PDF to Ask my Book site in 60 seconds.
https://www.steamship.com/build/ask-my-book-site
Scenarios or use cases for training GPT3 on your own data
We covered 4 different methods to train GPT3 on your own text. Here are some interesting use cases.
Personalized email generator
Prepare a dataset from the emails you have sent, and then fine-tune a DaVinci model on this dataset. You now will have a personalized email generator to follow your style and write emails for you.
A chatbot that talks in the style of someone famous
Chat with a famous author, like Isaac Asimov or Carl Sagan. Prepare a dataset of their books or words and then fine-tune GPT3.
Blog posts
Write blog posts in your style, with your information.
Chat with an author or a book
Sahil wrote a book and made a site where you could ask that book questions.
Customer Service
Train GPT3 on your help docs and style so customers can get quick answers.
What did I miss? Are any other methods or use cases you find compelling? Let me know in the comments.
See you next week!
-Josh
Philosopher here! Can I train with every known work of two philosophers and then simulate a conversation between the two, including on topics that they had significant but very subtle philosophical disagreement on? andvwhete bothbusefctechbical terms in slightly different ways?How does one train humor?
Hi. I think many people have personal journals and would like to create AI versions of themselves. It can be therapeutic to talk to a past version of yourself built on diary entries from 15 years ago.
Could fine tuning allow for this? An old diary doesn't seem to lend itself to "prompt" "answer" format.