How Does ChatGPT Work?

Simeon Spencer
Apr 14, 2023
3 min read

Updated: Apr 21, 2023

How Does ChatGPT Work?

To understand how ChatGPT works you first need to know the underlying model and how it was repeatedly tweaked to eventually become the ChatGPT you know today. The foundation for ChatGPT began with a Large Language Model (LLM), a class of Natural Language Processing (NLP) models. Large Language Models The most basic training of language models involves predicting a word in a sequence of words. This technique is deployed through a Long-Short-Term-Memory (LSTM) model where the model fills in the blank in sentences with the most statistically probable word given the surrounding context. However, the LTSM model is limited in its ability to gauge the value between surrounding words and process the input data as a whole. This can result in less complex relationships between words and meanings. To combat this, Google Brain introduced transformers that hat can process all input data simultaneously, allowing for varying weight to different parts of the data and infusing more meaning into LSTMs. This has led to significant improvements in processing larger datasets. Generative Pre-training Transformer (GPT) and Self-Attention GPT models were then launched in 2018 by OpenAI as GPT-1 and have continued to evolve with GPT-2, GPT-3, InstructGPT, and ChatGPT. GPT-3 represented the most significant advancement in the GPT model's evolution due to improvements in computational efficiency, allowing for training on larger datasets and expanding its knowledge base and abilities. All GPT models use transformer architecture that includes an encoder to process input and a decoder to generate output. The encoder and decoder have a self-attention mechanism that differentially weighs parts of the sequence to determine context and meaning. The self-attention mechanism converts tokens (pieces of text like words, sentences, or groups of text) into vectors and assigns them importance in the sequence. GPT uses a ‘multi-head’ attention mechanism, an evolution of self-attention that performs the process several times, generating new linear projections of query, key, and value vectors. This extension allows the model to understand more complex relationships and sub-meanings within the input data. The encoder also uses masked-language modeling to understand the relationship between words and produce more understandable responses. Despite these advancements, GPT-3 was prone to hallucinations and produced unhelpful or biased outputs. Innovative training methodologies were introduced in ChatGPT to solve these issues. ChatGPT ChatGPT is a spinoff of InstructGPT, which introduced a novel approach to incorporating human feedback into the training process to better align the model outputs with user intent. Reinforcement Learning from Human Feedback (RLHF) is described in depth in OpenAI’s 2022 paper Training language models to follow instructions with human feedback and is simplified below. Step 1: Supervised Fine Tuning (SFT) Model OpenAI developed the GPT-3.5 or SFT model by fine-tuning the GPT-3 using a supervised training dataset created by 40 contractors. The dataset included prompts collected from actual user entries into the Open API, with the labellers creating an appropriate response for each prompt. To maximize diversity in the prompts dataset, prompts from a user ID were limited to 200, and any prompts containing personally identifiable information were removed. Labelers also created sample prompts for categories with minimal data, including plain prompts, few-shot prompts, and user-based prompts. The resulting dataset had 13,000 input/output samples.

Step 2: Reward Model

Step 2 involves training a reward model after the SFT model is trained in Step 1. The reward model takes prompts and responses as inputs and produces a scalar value called a reward. This model is needed for Reinforcement Learning in Step 3, which teaches the model to produce outputs that maximize its reward. To train the reward model, labellers have to rank 4 to 9 SFT model outputs for a single prompt to create different combinations of output ranking. To avoid overfitting, the model is built using each group of rankings as a single batch datapoint.

Step 3: Reinforcement Learning Model

In Step 3, the model generates a response to a random prompt using the policy learned in Step 2 to maximize its reward. A reward value is determined based on the response and prompt pair using the reward model developed in Step 2. This reward is then used to evolve the policy of the model. The Proximal Policy Optimization (PPO) methodology introduced in 2017 by Schulman et al. is used to update the model's policy as each response is generated. The PPO methodology incorporates a per-token Kullback–Leibler (KL) penalty from the SFT model to avoid over-optimizing the reward model and deviating too drastically from the human intention dataset. Steps 2 and 3 can be iterated r

epeatedly, although this is not commonly done in practice.

All these multiple models have resulted in the ChatGPT 4 you know today. We expect ChatGPT to continue to advance at a rapid pace as public access provides an abundance of rich data and feedback with which the OpenAI team can leverage to continue building upon the above models to eventually make ChatGPT able to learn autonomously.

How Does ChatGPT Work?

Recent Posts

Comments

Want to Know When We Post?

Heading 2

Already Accessed Free Article !