Member-only story
Fine-tuning LLMs on Human Feedback (RLHF + DPO)
Preference tuning with Python code
This article is part of a larger series on Large Language Models (LLMs). In the previous post, I discussed how OpenAI and DeepSeek used reinforcement learning to create their latest advanced reasoning models. Here, I will discuss another way to use reinforcement learning (RL) to fine-tune LLMs on human feedback (i.e., RLHF) and a more efficient reformulation of it (i.e., DPO). I’ll start by reviewing the key ideas behind these approaches and then share a concrete example with Python code.
In 2020, OpenAI released GPT-3, a large language model (LLM) capable of performing arbitrary NLP tasks by simply seeing a few examples. This consisted of writing clever inputs for the model (i.e. prompts) to trick it into performing a desired task (e.g. translation, question-answering, and cloze tasks) [1].
Despite its groundbreaking performance, GPT-3 was still a long way from the practical LLMs we see today. In other words, additional training was necessary to turn GPT-3 into ChatGPT.
InstructGPT
Two years after the release of the GPT-3 paper, OpenAI published InstructGPT, a fine-tuned version of GPT-3 more aligned with human preferences [2].