Fine-tuning LLMs on Human Feedback (RLHF + DPO)
Preference tuning with Python code
This article is part of a larger series on Large Language Models (LLMs). In the previous post, I discussed how OpenAI and DeepSeek used reinforcement learning to create their latest advanced reasoning models. Here, I will discuss another way to use reinforcement learning (RL) to fine-tune LLMs on human feedback (i.e., RLHF) and a more efficient reformulation of it (i.e., DPO). I’ll start by reviewing the key ideas behind these approaches and then share a concrete example with Python code.
In 2020, OpenAI released GPT-3, a large language model (LLM) capable of performing arbitrary NLP tasks by simply seeing a few examples. This consisted of writing clever inputs for the model (i.e. prompts) to trick it into performing a desired task (e.g. translation, question-answering, and cloze tasks) [1].
Despite its groundbreaking performance, GPT-3 was still a long way from the practical LLMs we see today. In other words, additional training was necessary to turn GPT-3 into ChatGPT.
InstructGPT
Two years after the release of the GPT-3 paper, OpenAI published InstructGPT, a fine-tuned version of GPT-3 more aligned with human preferences [2].
This was an essential step because the GPT-3 base model wouldn’t typically generate completions valuable to the user (unless the user was a savvy prompter). Additionally, since it was trained on all sorts of wild data from the internet, its responses could be toxic and offensive.
InstructGPT, on the other hand, was an intuitive chatbot that generated helpful and harmless responses [2]. This was achieved through two rounds of fine-tuning. First, supervised fine-tuning (SFT) on chat data (i.e. input-response pairs). Second, reinforcement learning from human feedback (RLHF). Here, we’ll focus on the latter approach.
Reinforcement Learning from Human Feedback (RLHF)
Reinforcement learning (RL) allows models to learn by trial and error. This consists of assigning rewards to model predictions during training and updating its parameters based on these rewards.
RL from human feedback (RLHF) is a way to train models on human preferences at scale [2]. Rather than humans assigning rewards to model outputs in real-time during training, a reward model is trained in a supervised way from human preferences and used as a proxy for them.
Reward Model
To train the reward model, multiple responses were generated from GPT-3 for a single prompt, and then human labelers ranked these responses based on (detailed) guidelines of helpfulness and harmlessness [2].
This is an important detail because it is far easier (and faster) for humans to evaluate the relative quality of responses (i.e., rank them) than to write optimal responses from scratch. The result of this process is higher quality and quantity training data for the reward model.
Proximal Policy Optimization (PPO)
The reward model was then used to train InstructGPT via Proximal Policy Optimization (PPO). This is an efficient RL algorithm that updates model parameters based on a reward value [3].
PPO combines two key ideas to create a simpler and more stable optimization compared to the true objective of reinforcement learning (i.e. find a policy (model) that maximizes the expected cumulative reward across all possible trajectories through an environment).
First, it uses a surrogate objective, meaning one that approximates the true objective. Second, it incorporates clipping to ensure optimization steps are not too big. The PPO objective is written below.
Limitations of RLHF
Although, after RLHF, InstructGPT outperforms GPT-3 and its SFT checkpoint across human preference metrics [2], it has a key limitation. Namely, the final model will only be as good as the training data used to create the reward model.
This upper bound contrasts other RL approaches, which can (in principle) improve performance indefinitely with additional training [4]. If we reflect on this point for a moment, it may rouse suspicion.
If RLHF is fundamentally limited by the training data used to create the reward, could we just use this preference data to train an LLM directly? While this logic is handy-wavy, the answer is yes.
Direct Policy Optimization (DPO)
Direct Policy Optimization (DPO) consists of reformulating RLHF as a simple text classification problem [5]. Its authors show that the optimal policy for RLHF subject to a KL-divergence constraint can be derived in a closed form.
Therefore, we can minimize the difference between the log probabilities of a model and the optimal policy with respect to preferred-dispreferred response training pairs. The corresponding loss function for this process is given below [5].
The key upside of DPO is we can get the same performance gains as RLHF with a significantly simpler training setup. To demonstrate this, let’s walk through a concrete example.
Example Code: Fine-tuning Qwen2.5–0.5B on Video Title Preferences
LLM’s tend to have poor taste when it comes to content creation. That is why it’s pretty obvious when someone uses AI to write a blog post or any other type of content.
Here, I am going to show how we can mitigate this by fine-tuning Qwen2.5–0.5B-Instruct to generate YouTube titles based on my preferences using DPO. The dataset, model, and example code are freely available on the Hugging Face hub and GitHub.
Step 1: Curate preference data
The most important and time-consuming part of this process is generating the preference dataset. My process for doing that was as follows.
- Write a list of 114 video ideas
- Generate 5 titles for each idea using Qwen2.5–7B-Instruct (via Together AI’s API)
- For each idea, create 10 (i.e. 5 choose 2) head-to-head title pairs
- Manually review all 1140 title pairs and label which one is better
It took me about 3 hours to manually label all 1140 pairs. Once I finished, I reformatted the data into three columns: prompt, chosen, and rejected to match the expected format for the trl library’s DPOTrainer.
You can view the full dataset here. The code used for generating the title pairs is freely available here.
Step 2: Fine-tune using DPO
With the preference data (finally) prepared, we can do the easy part: fine-tuning the model. We start by importing some helpful libraries.
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
Next, we will import the training dataset (from Step 1) and the base model (Qwen2.5–0.5B-Instruct).
# load dataset
dataset = load_dataset("shawhin/youtube-titles-dpo")
# load model and tokenizer
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # set pad token
Before fine-tuning this model, it’s helpful to get a sense of its title writing ability. So, let’s generate a title with it using a prompt from the validation dataset.
# helper function
def format_chat_prompt(user_input):
"""
Formats user input into the chat template format with <|im_start|> and
<|im_end|> tags.
Args:
user_input (str): The input text from the user.
Returns:
str: Formatted prompt for the model.
"""
# Format user message
user_prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n"
# Start assistant's turn
assistant_prompt = "<|im_start|>assistant\n"
# Combine prompts
formatted_prompt = user_prompt + assistant_prompt
return formatted_prompt
# Set up text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Example prompt
prompt = format_chat_prompt(dataset['valid']['prompt'][0][0]['content'])
# Generate output
outputs = generator(prompt, max_length=100, truncation=True,
num_return_sequences=1, temperature=0.7)
print(outputs[0]['generated_text'])
<|im_start|>user
Given the YouTube video idea write an engaging title.
**Video Idea**: intro independent component analysis
**Additional Guidance**:
- Title should be between 30 and 75 characters long
- Only return the title idea, nothing else!<|im_end|>
<|im_start|>assistant
"Unlocking Independent Component Analysis: The Key to Understanding Your Data!"
We can see that this title isn’t very good. It is too long and includes vague information like “The Key to Understanding Your Data!”. This is in contrast to the title I actually used for a video on ICA: “Independent Component Analysis (ICA) | EEG Analysis Example Code”
To make the model completions more aligned with my preferences, let’s fine-tune it. First, we need to define the training arguments for DPO. Here, I use a batch size of 8 and train for 3 epochs. A checkpoint is saved at each epoch, and the best one is loaded at the end.
# define training args
ft_model_name = model_name.split('/')[1].replace("Instruct", "DPO")
training_args = DPOConfig(
output_dir=ft_model_name,
logging_steps=25,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
save_strategy="epoch",
eval_strategy="epoch",
eval_steps=1,
)
Next, we can train the model using the DPOTrainer.
# train model
trainer = DPOTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset['train'],
eval_dataset=dataset['valid'],
)
trainer.train()
The model begins to overfit after epoch 2, so we will use that checkpoint as the final model. Here’s what the fine-tuned model comes up with for the same video idea we saw earlier.
<|im_start|>user
Given the YouTube video idea write an engaging title.
**Video Idea**: intro independent component analysis
**Additional Guidance**:
- Title should be between 30 and 75 characters long
- Only return the title idea, nothing else!<|im_end|>
<|im_start|>assistant
Independent Component Analysis for Beginners
Although it’s not the same as the real one I used, it’s still a significant improvement from the base model’s generation.
Step 3: Evaluate fine-tuned model
While looking at single examples gives us a “vibe check” for the new model’s performance, this is not a robust evaluation strategy. One way to do this is to compare the base and fine-tuned model title generations.
Unfortunately, things like taste and preference are difficult to capture in standard evals (which is why we used DPO in the first place). Additionally, I couldn’t effectively automate with a judge LLM (despite several attempts with GPT-4o).
That’s why I, again, did some manual data labeling. My process was as follows.
- Pick out 50 random video title ideas from my initial list
- For each idea, generate a title using the base and fine-tuned models, respectively
- Manually review all 50 pairs and assign preference labels
- Compute how often fine-tuned model titles were preferred
Since I only considered 50 pairs, this took significantly less time (~10 min). The final result was the fine-tuned titles were preferred 68% of the time. A snapshot of these data is given below.
Limitations
Although a significant improvement is seen in title generations for three fine-tuned models, there are a handful of opportunities for improvement.
- The 7b version of Qwen generated preference data, while the 0.5b version was fine-tuned. The same model should be used in the future to simplify the learning task.
- Several title pairs in the preference dataset were both bad. Future works should experiment with removing these pairs and observing performance changes.
- Some title generations were in Chinese, likely due to its prominence in Qwen’s training data. Next time, I’ll experiment with other models like Llama or Gemma.
- Here, I used a 500M parameter model so it could quickly run on my laptop. However, a bigger model will likely perform better on this task.
- DPO was applied directly to an instruction-tuned model. Future works should experiment with first doing a round of SFT on real title examples, then do preference tuning with DPO.
Conclusion
LLMs can perform a wide range of tasks out of the box. However, refining their responses based on human preferences can be challenging via prompt engineering or supervised fine-tuning.
Here, we discussed how RLHF and DPO can help with this by fine-tuning an LLM on relative human preferences (i.e., ranked model completions). We then used DPO to align YouTube title ideas from Qwen-0.5B with my personal preferences.
If you have any questions or suggestions, please let me know in the comments :)
🗞️ Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw
🧑🎓 Learn AI in 6 weeks by building it: https://maven.com/shaw-talebi/ai-builders-bootcamp
References
[1] arXiv:2005.14165 [cs.CL]
[2] arXiv:2203.02155 [cs.CL]
[3] arXiv:1707.06347 [cs.LG]
[4] Deep Dive into LLMs like ChatGPT
[5] arXiv:2305.18290 [cs.LG]