From current events to proprietary data: How to train GPT-3 for your business needs
How do you get real business value out of ChatGPT?
With all the hype there’s been around OpenAI’s groundbreaking technology, that might sound like an odd question—isn’t a powerful, easy-to-use language model obviously going to generate value? The fact that it’s reached 100 million users faster than any digital application in history certainly speaks to its widespread appeal. But as we tinkered with it, some critical limitations became clear.
One of our defining principles at Smart Design is a focus on user experience and user value, regardless of the technology that enables it. This is why a powerful new technology isn’t enough on its own. To deliver value, it has to address a human need or drive an improved experience.
ChatGPT, for all its promise and appeal, is still a very generic tool. The GPT-3 platform it’s built on is trained on a huge but unfocused data set. For the average business or user, this makes it useful only up to a point—like a personal assistant who’s world-class at looking things up, but knows nothing about you, the specific challenges you face, or even the recent events affecting you.
Any language model is trainable though, and a chat-based assistant that actually knows your business, your industry, your company, or you personally would be game-changing in a way that ChatGPT currently isn’t. That suggests an opportunity: to customize GPT-3 as a platform, using OpenAI’s APIs.
What would that involve, we wondered? How much training data would it take? How much would it cost and how hard would it be?
So we did what any good developer would do, and started running experiments. An obvious place to start is current events since the latest version (GPT 3.5) is only trained on data up through June 2021. Our hypothesis was that we could train GPT-3 on RSS feeds of major news sites, then have it answer questions about recent events.
OpenAI offers a service called fine tuning which allows you to customize a model by feeding it prompts and responses that exemplify what you want it to learn. This was our first approach in running this experiment.
Explore our experiments with GPT on GitHub
Fine-tuning with OpenAI’s GPT model
The first thing we did was install OpenAI’s Python package, then chose to train it on a topic that required recent information: the train derailment in Ohio in 2023.
pip install --upgrade openai
import os
os.environ['OPENAI_API_KEY'] = "Add OpenAI key here"
openai.api_key = "Add OpenAI key here"
prompt = "Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
model="text-davinci-003",
prompt=prompt
)
print(result["choices"][0]["text"])The exact location of the train derailment is not available, as different
Our first training attempt was to fine-tune the model by adding specific data about the train derailment in early 2023. This required preparing the data, saving it to a file, and uploading it to OpenAI, as follows:
# from https://en.wikipedia.org/wiki/2023_Ohio_train_derailment
examples = [
{"prompt": "2023 Ohio train derailment", "completion": "The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States.[1] The freight train burned for more than two days, and then emergency crews conducted a controlled burn of several railcars at the request of state officials,[2] which released hydrogen chloride and phosgene into the air.[1] As a result, residents within a 1-mile (1.6-kilometer) radius were evacuated, and an emergency response was initiated from agencies in Ohio, Pennsylvania, and West Virginia. The U.S. federal government sent Environmental Protection Agency (EPA) administrator Michael S. Regan to provide assistance on February 16, 2023."} ]
f = open("trainingdata.jsonl", "w")
for example in examples: f.write(json.dumps(example) + "\n")
file = openai.File.create(file=open("trainingdata.jsonl"), purpose='fine-tune')
From here, we instructed OpenAI to begin fine-tuning a model using DaVinci as a base model, but including the additional information about the 2023 train derailment in Ohio.
fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")
openai api fine_tunes.follow -i {fine_tune['id']}
result = openai.Completion.create(
model="davinci:ft-personal-2023-02-16-20-32-47",
prompt=prompt
)
print(result["choices"][0]["text"])Officials say the train derailed in Nantes Dorian, just west of
Fine-tuning using more data from RSS feeds
For our second experiment, we decided to fine tune the model using recent news, then ask it about a current event. We began by installing an RSS parser, then had it download all of the recent news from several major news outlets via RSS feed, and used that to fine tune the model.
pip install rss-parser
from rss_parser import Parser
from requests import getrss_urls = [
"https://rss.nytimes.com/services/xml/rss/nyt/US.xml",
"https://rss.nytimes.com/services/xml/rss/nyt/World.xml",
"http://feeds.bbci.co.uk/news/rss.xml?edition=us",
"http://rss.cnn.com/rss/cnn_world.rss",
"http://rss.cnn.com/rss/cnn_us.rss",
"https://feeds.washingtonpost.com/rss/world?itid=lk_inline_manual_36",
"https://feeds.washingtonpost.com/rss/national?itid=lk_inline_manual_32",
"https://feeds.a.dj.com/rss/RSSWorldNews.xml",
"https://feeds.a.dj.com/rss/WSJcomUSBusiness.xml",
"https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en"
]for url in rss_urls:
xml = get(url)
parser = Parser(xml=xml.content)
feed = parser.parse()
for item in feed.feed:
prompts.append({"prompt": item.title, "completion": item.description})f = open("rss-trainingdata.jsonl", "w") for prompt in prompts:
f.write(json.dumps(prompt) + "\n")
openai tools fine_tunes.prepare_data -f rss-trainingdata.jsonl -q
file = openai.File.create(file=open("rss-trainingdata_prepared.jsonl"), purpose='fine-tune')
fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")openai api fine_tunes.follow -i {fine_tune['id']}
prompt = "Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
model="davinci",
prompt=prompt + '\n\n###\n\n'
)
print("Before (non-finetuned) result: " + result['choices'][0]['text'])
result = openai.Completion.create(
model="davinci:ft-personal-2023-02-16-21-29-25",
prompt=prompt + '\n\n###\n\n'
)
print("After (finetuned) result: " + result['choices'][0]['text'])Before (non-finetuned) result:
Additional Information:
Sound Transit’s emergency closure of
After (finetuned) result:
Backgrounder
In the early hours of February 10, 2019
Getting customized results without fine-tuning
For this next experiment, we tried something that seems counterintuitive: we posed a question to GPT-3 and provided the answer to the question as a pre-condition.
prompt = "Given that The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States. Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
model="text-davinci-003",
prompt=prompt + '\n\n###\n\n'
)
print(result['choices'][0]['text'])
The train carrying hazardous materials derailed in East Palestine, Ohio, United States.
pip install langchain
from langchain.docstore.document import Document documents = []
for url in rss_urls:
xml = get(url)
parser = Parser(xml=xml.content)
feed = parser.parse()
for item in feed.feed:
documents.append(Document(
page_content=item.title + '. ' + item.description
))from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
prompt = "Where did the train carrying hazardous materials derail?"
chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":documents, "question":prompt}, return_only_outputs=True)["output_text"]InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 17073 tokens (16817 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
Using text embeddings and vector similarity searches to pre-populate a prompt
pip install faiss-cpu
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
search_index = FAISS.from_documents(documents, OpenAIEmbeddings())
prompt = "Where did the train carrying hazardous materials derail?"
chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":search_index.similarity_search(prompt, k=4), "question":prompt}, return_only_outputs=True)["output_text"]
' East Palestine, Ohio.'
It worked! By pairing the similar text search with GPT-3, we’re able to now give answers about news in the RSS feed.
Limitations, possibilities, caveats, and final thoughts
Learn more about Technology at Smart Design
About Carter Parks
Carter Parks is a systems architect who has a knack for applying new technologies to the right problems. He brings expertise in machine learning, full stack web and mobile development, and IoT and has worked with clients in sectors ranging from eCommerce to nutrition, finance, and SaaS. Notable clients include Gatorade. When he isn’t coding, you can find him in the outdoors, probably on a long trail run, or playing the piano.
Resources
Langchain
Question answering
Dagster.io
Build a GitHub support bot with GPT3, LangChain, and Python
GitHub
OpenAI Cookbook