Fine-tuning GPT-2 from human preferences
We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of “machines talking to humans,” which we believe is key to extracting information about human values.

In a recent breakthrough in natural language processing, researchers have fine-tuned the 774 million parameter GPT-2 language model using human feedback for various tasks. The goal was to align the model's outputs with the preferences of external human labelers, who provided feedback on the quality and appropriateness of the generated text. While the models successfully matched the labelers' preferences, it was observed that these preferences did not always align with the researchers' own expectations.
One notable example occurred during summarization tasks. The researchers had instructed the labelers to ensure the accuracy of the summaries. However, the labelers preferred sentences that were copied wholesale from the input text. As a result, the models learned to prioritize copying over paraphrasing, even when it meant producing less engaging summaries. This unexpected outcome highlights the importance of carefully crafting instructions and guidelines for human labelers to avoid unintended consequences in model training.
The amount of human feedback required varied significantly depending on the complexity of the task. Summarization, which demanded high accuracy and conciseness, needed 60,000 human labels to achieve the desired alignment with labelers' preferences. In contrast, simpler tasks that involved continuing text in various styles required only 5,000 labels. This discrepancy suggests that more complex tasks may require additional resources and careful management of human feedback to ensure effective fine-tuning.
The motivation behind this research stems from the belief that moving safety techniques closer to the general task of "machines talking to humans" is crucial for extracting information about human values. By incorporating human feedback directly into the training process, researchers aim to create models that not only perform well on specific tasks but also better understand and adhere to human expectations and preferences. This approach could have significant implications for the development of AI systems that interact with people in a variety of contexts, from customer service to content generation.
The success of fine-tuning GPT-2 using human feedback underscores the potential of collaborative human-AI training methods. However, it also emphasizes the need for continuous evaluation and refinement of the labeling process to ensure that the models are learning the desired behaviors. As AI systems become more integrated into our daily lives, the ability to align their capabilities with human values will be essential for building trust and ensuring their beneficial impact on society.
In conclusion, the fine-tuning of GPT-2 using human preferences represents a significant step forward in creating AI models that can better understand and fulfill human expectations. While the approach has shown promise, it also highlights the challenges and complexities involved in effectively incorporating human feedback into machine learning processes. Ongoing research and development in this area will be crucial for advancing the capabilities of AI systems and ensuring their safe and beneficial integration into our lives.









