Help | Advanced Search

Computer Science > Computation and Language

Title: learning to summarize from human feedback.

Abstract: As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Learning to summarize with human feedback

Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F. Christiano

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about---summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models. We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

Name Change Policy

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings.

Use the "Report an Issue" link to request a name change.

Recursively Summarizing Books with Human Feedback

summarizing books with human feedback

A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases (∼5% of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.

summarizing books with human feedback

Long Ouyang

Daniel M. Ziegler

Nisan Stiennon

summarizing books with human feedback

Paul Christiano

summarizing books with human feedback

Related Research

Self-critiquing models for assisting human evaluators, human-in-the-loop abstractive dialogue summarization, learning to summarize from human feedback, benchmarking large language models for news summarization, webgpt: browser-assisted question-answering with human feedback, quality: question answering with long input texts, yes, time-efficient reward learning via visually assisted cluster ranking.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

Recursively Summarizing Books with Human Feedback

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account .

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

You answered out of questions correctly.

= 4">Well done!

Appreciate you reporting the issue. We'll look into it.

Discover 7422 Tools

Screenshot of Summarizing Books with Human Feedback Website

Gain book insights with AI and humans.

OpenAI’s Summarizing Books with Human Feedback: Accurate Insights. Comprehensive Summaries.

Get insights and key points from long books with our AI-powered summarization and human-reviewed summaries. Comprehensive and easy to understand.

Summarizing Books with Human Feedback

Introducing summarizing books with human feedback: gain insights and key points quickly.

OpenAI’s Summarizing Books with Human Feedback service offers a convenient solution for avid readers who seek efficient and accurate ways to extract insights from books. By condensing lengthy written content into concise summaries, our service captures the primary points and main ideas encapsulated within the text. What truly sets us apart is our unique incorporation of human feedback, ensuring the utmost accuracy and clarity in our summaries. By combining traditional AI-powered summarization with human review and feedback, our system generates comprehensive yet easily digestible summaries. Say goodbye to countless hours spent reading every page, and embark on a journey of understanding and knowledge today with OpenAI’s Summarizing Books with Human Feedback service. With Summarizing Books with Human Feedback, you can quickly access crucial insights and key points from lengthy books, saving you valuable time and effort. Our summaries are meticulously crafted through the synergy of AI-powered summarization and meticulous human review. The result is a comprehensive summary that effectively captures the essence of the original content, while remaining easily comprehensible. Experience the benefits of concise and insightful summaries today, with OpenAI’s Summarizing Books with Human Feedback. Immerse yourself in the main ideas and concepts of books effortlessly with OpenAI’s Summarizing Books with Human Feedback service. Our summaries combine the power of AI-powered summarization and human review, resulting in comprehensive and easily understandable summaries. Don't let lengthy books hinder your journey of knowledge, start exploring the world of literature with our service today.

OpenAI's Summarizing Books with Human Feedback service is the ideal tool for professionals and individuals who are pressed for time but still need to extract valuable insights from books. Whether you are a busy executive, an entrepreneur, a student, or a researcher, our service can help you accelerate productivity in your job or business. This service is perfect for individuals who need to stay up-to-date with the latest trends, concepts, and knowledge in their industry, but don't have the luxury of spending hours reading lengthy books. By using our service, you can quickly grasp the main ideas and key points of a book without investing excessive time and effort. The combination of AI-powered summarization and human review sets us apart from other solutions. While AI can efficiently generate summaries, human feedback ensures the accuracy, clarity, and context of the summaries. This result is comprehensive and easy-to-understand summaries that capture the essence of the book, enabling you to gain the necessary insights quickly. Whether you need to prepare for a meeting, conduct market research, write a report, or simply expand your knowledge base, Summarizing Books with Human Feedback can significantly enhance your productivity. Embrace the power of our service today and embark on a journey of understanding and knowledge.

Main Features

Save time by avoiding lengthy reading.

Benefits of using Summarizing Books with Human Feedback

OpenAI’s Summarizing Books with Human Feedback service offers a range of benefits for individuals seeking to efficiently extract information from books. With the ability to quickly gain insights and key points from lengthy books, this tool is the perfect solution for busy readers who are searching for a way to save time without compromising understanding. By combining AI-powered summarization with human review, our service ensures the creation of accurate and concise summaries. This unique approach sets us apart from other tools on the market, as it maximizes the comprehensiveness and clarity of the summaries produced. Through the incorporation of human feedback, we guarantee summaries that not only capture the essential concepts of a book but also present them in an accessible manner. With our comprehensive and easy-to-understand summaries, you can delve into the main ideas and concepts of a book without dedicating countless hours to reading every page. This streamlined approach allows you to navigate through extensive texts effectively, making the tool a time-saving asset for individuals with limited availability. Embark on your journey of understanding and knowledge today by harnessing OpenAI’s Summarizing Books with Human Feedback service. Experience the convenience of quickly absorbing book content while obtaining the key insights that will enrich your learning experience.

Full Review

At OpenAI, we understand the value of time and the need to gain valuable insights quickly. That's why we are proud to introduce our Summarizing Books with Human Feedback service. This tool is designed to cater to busy readers and provide them with concise summaries of lengthy books, allowing them to extract key points and gain insightful knowledge without the need to read every page. What sets our service apart is our unique combination of AI-powered summarization and human review. We leverage the power of traditional AI algorithms to analyze and distill the content of books into coherent summaries. However, we believe that human feedback is essential to ensuring accuracy and clarity in these summaries. That's why we have incorporated a comprehensive human review process into our system. The result of our innovative approach is summaries that are not only comprehensive but also easy to understand. By harnessing the capabilities of AI and human expertise, our service delivers summaries that capture the main ideas and concepts of a book, allowing readers to quickly immerse themselves in the essence of the text. With our Summarizing Books with Human Feedback service, you can save valuable time and still gain a deep understanding of any book. Whether you are a student looking to grasp the key concepts of a textbook or a professional seeking insights from a business guide, our tool will provide you with concise and informative summaries. Start your journey of knowledge and understanding today by using OpenAI’s Summarizing Books with Human Feedback service. Trust us to deliver accurate, comprehensive, and easy-to-understand summaries that will empower you to make the most of your reading experience.

Similar Archives

Explore similar ai tools:.

ai-archive logo white

Join the AI revolution and explore the world of artificial intelligence. Stay connected with us.

  • Privacy Policy
  • Cookie Policy
  • Acceptable Use Policy
  • Terms of Service
  • All AI Tools
  • Top 100 AI Tools
  • Latest AI Tools

Copyright © 2024 AI-ARCHIVE

Today Listed Tools 271

constantinbosse.com

Interests, thoughts and projects, good read: ai – summarizing books with human feedback.

“Summarizing Books with Human Feedback” was published on OpenAI.com . It is a quick read, very clear and well written. They share some examples from on how they had AI summarizing books. In the article they give the example of the classic “Alice’s Adventures in Wonderland” , by Lewis Carroll which they used to test it.

They briefly explain their recursive approach and how they include human feedback during the process evaluating the AI’s summarys.

summarizing books with human feedback

They also include a link to their official research paper written by Jeff Wu , Long Ouyang , Daniel M. Ziegler , Nisan Stiennon , Ryan Lowe , Jan Leike , Paul Christiano . 

In my opinion it is an impressive task that has been achieved so far. Though there is still lots to improve from briefly looking at the summaries, to have AI summarizing books is still impressive.

It is a remarkable work that I think can have great application especially for technical texts in my opinion, though for fiction texts I think it will not be as applicable. I wish I had had this tool when I was studying to double check my summaries.

I believe that the summaries lack some human touch but this is logic. However this is very subjective though from my side.

Share this:

  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to email a link to a friend (Opens in new window)

2 thoughts on “ Good read: AI – Summarizing Books with Human Feedback ”

  • Pingback: Natural Language Processing AI Technology a Quick and Brief Intro with Examples - constantinbosse.com
  • Pingback: Natural Language Processing AI Technology a Quick and Brief Intro with Examples - Bosse

Comments are closed.

Privacy Overview

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Nathan Lambert's avatar

Language models have shown impressive capabilities in the past few years by generating diverse and compelling text from human input prompts. However, what makes a "good" text is inherently hard to define as it is subjective and context dependent. There are many applications such as writing stories where you want creativity, pieces of informative text which should be truthful, or code snippets that we want to be executable.

Writing a loss function to capture these attributes seems intractable and most language models are still trained with a simple next token prediction loss (e.g. cross entropy). To compensate for the shortcomings of the loss itself people define metrics that are designed to better capture human preferences such as BLEU or ROUGE . While being better suited than the loss function itself at measuring performance these metrics simply compare generated text to references with simple rules and are thus also limited. Wouldn't it be great if we use human feedback for generated text as a measure of performance or go even one step further and use that feedback as a loss to optimize the model? That's the idea of Reinforcement Learning from Human Feedback (RLHF); use methods from reinforcement learning to directly optimize a language model with human feedback. RLHF has enabled language models to begin to align a model trained on a general corpus of text data to that of complex human values.

RLHF's most recent success was its use in ChatGPT . Given ChatGPT's impressive abilities, we asked it to explain RLHF for us:

summarizing books with human feedback

It does surprisingly well, but doesn't quite cover everything. We'll fill in those gaps!

RLHF: Let’s take it step by step

Reinforcement learning from Human Feedback (also referenced as RL from human preferences) is a challenging concept because it involves a multiple-model training process and different stages of deployment. In this blog post, we’ll break down the training process into three core steps:

  • Pretraining a language model (LM),
  • gathering data and training a reward model, and
  • fine-tuning the LM with reinforcement learning.

To start, we'll look at how language models are pretrained.

Pretraining language models

As a starting point RLHF use a language model that has already been pretrained with the classical pretraining objectives (see this blog post for more details). OpenAI used a smaller version of GPT-3 for its first popular RLHF model, InstructGPT . In their shared papers, Anthropic used transformer models from 10 million to 52 billion parameters trained for this task. DeepMind has documented using up to their 280 billion parameter model Gopher . It is likely that all these companies use much larger models in their RLHF-powered products.

This initial model can also be fine-tuned on additional text or conditions, but does not necessarily need to be. For example, OpenAI fine-tuned on human-generated text that was “preferable” and Anthropic generated their initial LM for RLHF by distilling an original LM on context clues for their “helpful, honest, and harmless” criteria. These are both sources of what we refer to as expensive, augmented data, but it is not a required technique to understand RLHF. Core to starting the RLHF process is having a model that responds well to diverse instructions .

In general, there is not a clear answer on “which model” is the best for the starting point of RLHF. This will be a common theme in this blog – the design space of options in RLHF training are not thoroughly explored.

Next, with a language model, one needs to generate data to train a reward model , which is how human preferences are integrated into the system.

summarizing books with human feedback

Reward model training

Generating a reward model (RM, also referred to as a preference model) calibrated with human preferences is where the relatively new research in RLHF begins. The underlying goal is to get a model or system that takes in a sequence of text, and returns a scalar reward which should numerically represent the human preference. The system can be an end-to-end LM, or a modular system outputting a reward (e.g. a model ranks outputs, and the ranking is converted to reward). The output being a scalar reward is crucial for existing RL algorithms being integrated seamlessly later in the RLHF process.

These LMs for reward modeling can be both another fine-tuned LM or a LM trained from scratch on the preference data. For example, Anthropic has used a specialized method of fine-tuning to initialize these models after pretraining (preference model pretraining, PMP) because they found it to be more sample efficient than fine-tuning, but no one base model is considered the clear best choice for reward models.

The training dataset of prompt-generation pairs for the RM is generated by sampling a set of prompts from a predefined dataset (Anthropic’s data generated primarily with a chat tool on Amazon Mechanical Turk is available on the Hub, and OpenAI used prompts submitted by users to the GPT API). The prompts are passed through the initial language model to generate new text.

Human annotators are used to rank the generated text outputs from the LM. One may initially think that humans should apply a scalar score directly to each piece of text in order to generate a reward model, but this is difficult to do in practice. The differing values of humans cause these scores to be uncalibrated and noisy. Instead, rankings are used to compare the outputs of multiple models and create a much better regularized dataset.

There are multiple methods for ranking the text. One method that has been successful is to have users compare generated text from two language models conditioned on the same prompt. By comparing model outputs in head-to-head matchups, an Elo system can be used to generate a ranking of the models and outputs relative to each-other. These different methods of ranking are normalized into a scalar reward signal for training.

An interesting artifact of this process is that the successful RLHF systems to date have used reward language models with varying sizes relative to the text generation (e.g. OpenAI 175B LM, 6B reward model, Anthropic used LM and reward models from 10B to 52B, DeepMind uses 70B Chinchilla models for both LM and reward). An intuition would be that these preference models need to have similar capacity to understand the text given to them as a model would need in order to generate said text.

At this point in the RLHF system, we have an initial language model that can be used to generate text and a preference model that takes in any text and assigns it a score of how well humans perceive it. Next, we use reinforcement learning (RL) to optimize the original language model with respect to the reward model.

summarizing books with human feedback

Fine-tuning with RL

Training a language model with reinforcement learning was, for a long time, something that people would have thought as impossible both for engineering and algorithmic reasons. What multiple organizations seem to have gotten to work is fine-tuning some or all of the parameters of a copy of the initial LM with a policy-gradient RL algorithm, Proximal Policy Optimization (PPO). Some parameters of the LM are frozen because fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive (for more, see Low-Rank Adaptation ( LoRA ) for LMs or the Sparrow LM from DeepMind) -- depending on the scale of the model and infrastructure being used. The exact dynamics of how many parameters to freeze, or not, is considered an open research problem. PPO has been around for a relatively long time – there are tons of guides on how it works. The relative maturity of this method made it a favorable choice for scaling up to the new application of distributed training for RLHF. It turns out that many of the core RL advancements to do RLHF have been figuring out how to update such a large model with a familiar algorithm (more on that later).

Let's first formulate this fine-tuning task as a RL problem. First, the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text). The action space of this policy is all the tokens corresponding to the vocabulary of the language model (often on the order of 50k tokens) and the observation space is the distribution of possible input token sequences, which is also quite large given previous uses of RL (the dimension is approximately the size of vocabulary ^ length of the input token sequence). The reward function is a combination of the preference model and a constraint on policy shift.

The reward function is where the system combines all of the models we have discussed into one RLHF process. Given a prompt, x , from the dataset, the text y is generated by the current iteration of the fine-tuned policy. Concatenated with the original prompt, that text is passed to the preference model, which returns a scalar notion of “preferability”, r θ r_\theta r θ ​ . In addition, per-token probability distributions from the RL policy are compared to the ones from the initial model to compute a penalty on the difference between them. In multiple papers from OpenAI, Anthropic, and DeepMind, this penalty has been designed as a scaled version of the Kullback–Leibler (KL) divergence between these sequences of distributions over tokens, r KL r_\text{KL} r KL ​ . The KL divergence term penalizes the RL policy from moving substantially away from the initial pretrained model with each training batch, which can be useful to make sure the model outputs reasonably coherent text snippets. Without this penalty the optimization can start to generate text that is gibberish but fools the reward model to give a high reward. In practice, the KL divergence is approximated via sampling from both distributions (explained by John Schulman here ). The final reward sent to the RL update rule is r = r θ − λ r KL r = r_\theta - \lambda r_\text{KL} r = r θ ​ − λ r KL ​ .

Some RLHF systems have added additional terms to the reward function. For example, OpenAI experimented successfully on InstructGPT by mixing in additional pre-training gradients (from the human annotation set) into the update rule for PPO. It is likely as RLHF is further investigated, the formulation of this reward function will continue to evolve.

Finally, the update rule is the parameter update from PPO that maximizes the reward metrics in the current batch of data (PPO is on-policy, which means the parameters are only updated with the current batch of prompt-generation pairs). PPO is a trust region optimization algorithm that uses constraints on the gradient to ensure the update step does not destabilize the learning process. DeepMind used a similar reward setup for Gopher but used synchronous advantage actor-critic (A2C) to optimize the gradients, which is notably different but has not been reproduced externally.

summarizing books with human feedback

Technical detail note: The above diagram makes it look like both models generate different responses for the same prompt, but what really happens is that the RL policy generates text, and that text is fed into the initial model to produce its relative probabilities for the KL penalty. This initial model is untouched by gradient updates during training .

Optionally, RLHF can continue from this point by iteratively updating the reward model and the policy together. As the RL policy updates, users can continue ranking these outputs versus the model's earlier versions. Most papers have yet to discuss implementing this operation, as the deployment mode needed to collect this type of data only works for dialogue agents with access to an engaged user base. Anthropic discusses this option as Iterated Online RLHF (see the original paper ), where iterations of the policy are included in the ELO ranking system across models. This introduces complex dynamics of the policy and reward model evolving, which represents a complex and open research question.

Open-source tools for RLHF

The first code released to perform RLHF on LMs was from OpenAI in TensorFlow in 2019.

Today, there are already a few active repositories for RLHF in PyTorch that grew out of this. The primary repositories are Transformers Reinforcement Learning ( TRL ), TRLX which originated as a fork of TRL, and Reinforcement Learning for Language models ( RL4LMs ).

TRL is designed to fine-tune pretrained LMs in the Hugging Face ecosystem with PPO. TRLX is an expanded fork of TRL built by CarperAI to handle larger models for online and offline training. At the moment, TRLX has an API capable of production-ready RLHF with PPO and Implicit Language Q-Learning ILQL at the scales required for LLM deployment (e.g. 33 billion parameters). Future versions of TRLX will allow for language models up to 200B parameters. As such, interfacing with TRLX is optimized for machine learning engineers with experience at this scale.

RL4LMs offers building blocks for fine-tuning and evaluating LLMs with a wide variety of RL algorithms (PPO, NLPO, A2C and TRPO), reward functions and metrics. Moreover, the library is easily customizable, which allows training of any encoder-decoder or encoder transformer-based LM on any arbitrary user-specified reward function. Notably, it is well-tested and benchmarked on a broad range of tasks in recent work amounting up to 2000 experiments highlighting several practical insights on data budget comparison (expert demonstrations vs. reward modeling), handling reward hacking and training instabilities, etc. RL4LMs current plans include distributed training of larger models and new RL algorithms.

Both TRLX and RL4LMs are under heavy further development, so expect more features beyond these soon.

There is a large dataset created by Anthropic available on the Hub.

What’s next for RLHF?

While these techniques are extremely promising and impactful and have caught the attention of the biggest research labs in AI, there are still clear limitations. The models, while better, can still output harmful or factually inaccurate text without any uncertainty. This imperfection represents a long-term challenge and motivation for RLHF – operating in an inherently human problem domain means there will never be a clear final line to cross for the model to be labeled as complete .

When deploying a system using RLHF, gathering the human preference data is quite expensive due to the direct integration of other human workers outside the training loop. RLHF performance is only as good as the quality of its human annotations, which takes on two varieties: human-generated text, such as fine-tuning the initial LM in InstructGPT, and labels of human preferences between model outputs.

Generating well-written human text answering specific prompts is very costly, as it often requires hiring part-time staff (rather than being able to rely on product users or crowdsourcing). Thankfully, the scale of data used in training the reward model for most applications of RLHF (~50k labeled preference samples) is not as expensive. However, it is still a higher cost than academic labs would likely be able to afford. Currently, there only exists one large-scale dataset for RLHF on a general language model (from Anthropic ) and a couple of smaller-scale task-specific datasets (such as summarization data from OpenAI ). The second challenge of data for RLHF is that human annotators can often disagree, adding a substantial potential variance to the training data without ground truth.

With these limitations, huge swaths of unexplored design options could still enable RLHF to take substantial strides. Many of these fall within the domain of improving the RL optimizer. PPO is a relatively old algorithm, but there are no structural reasons that other algorithms could not offer benefits and permutations on the existing RLHF workflow. One large cost of the feedback portion of fine-tuning the LM policy is that every generated piece of text from the policy needs to be evaluated on the reward model (as it acts like part of the environment in the standard RL framework). To avoid these costly forward passes of a large model, offline RL could be used as a policy optimizer. Recently, new algorithms have emerged, such as implicit language Q-learning (ILQL) [ Talk on ILQL at CarperAI], that fit particularly well with this type of optimization. Other core trade-offs in the RL process, like exploration-exploitation balance, have also not been documented. Exploring these directions would at least develop a substantial understanding of how RLHF functions and, if not, provide improved performance.

We hosted a lecture on Tuesday 13 December 2022 that expanded on this post; you can watch it here !

Further reading

Here is a list of the most prevalent papers on RLHF to date. The field was recently popularized with the emergence of DeepRL (around 2017) and has grown into a broader study of the applications of LLMs from many large technology companies. Here are some papers on RLHF that pre-date the LM focus:

  • TAMER: Training an Agent Manually via Evaluative Reinforcement (Knox and Stone 2008): Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model.
  • Interactive Learning from Policy-Dependent Human Feedback (MacGlashan et al. 2017): Proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function.
  • Deep Reinforcement Learning from Human Preferences (Christiano et al. 2017): RLHF applied on preferences between Atari trajectories.
  • Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces (Warnell et al. 2018): Extends the TAMER framework where a deep neural network is used to model the reward prediction.
  • A Survey of Preference-based Reinforcement Learning Methods (Wirth et al. 2017): Summarizes efforts above with many, many more references.

And here is a snapshot of the growing set of "key" papers that show RLHF's performance for LMs:

  • Fine-Tuning Language Models from Human Preferences (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
  • Learning to summarize with human feedback (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, Recursively Summarizing Books with Human Feedback (OpenAI Alignment Team 2021), follow on work summarizing books.
  • WebGPT: Browser-assisted question-answering with human feedback (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
  • InstructGPT: Training language models to follow instructions with human feedback (OpenAI Alignment Team 2022): RLHF applied to a general language model [ Blog post on InstructGPT].
  • GopherCite: Teaching language models to support answers with verified quotes (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
  • Sparrow: Improving alignment of dialogue agents via targeted human judgements (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
  • ChatGPT: Optimizing Language Models for Dialogue (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
  • Scaling Laws for Reward Model Overoptimization (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
  • Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.
  • Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization (Ramamurthy and Ammanabrolu et al. 2022): Discusses the design space of open-source tools in RLHF and proposes a new algorithm NLPO (Natural Language Policy Optimization) as an alternative to PPO.
  • Llama 2 (Touvron et al. 2023): Impactful open-access model with substantial RLHF details.

The field is the convergence of multiple fields, so you can also find resources in other areas:

  • Continual learning of instructions ( Kojima et al. 2021 , Suhr and Artzi 2022 ) or bandit learning from user feedback ( Sokolov et al. 2016 , Gao et al. 2022 )
  • Earlier history on using other RL algorithms for text generation (not all with human preferences), such as with recurrent neural networks ( Ranzato et al. 2015 ), an actor-critic algorithm for text prediction ( Bahdanau et al. 2016 ), or an early work adding human preferences to this framework ( Nguyen et al. 2017 ).

Citation: If you found this useful for your academic work, please consider citing our work, in text:

BibTeX citation:

Thanks to Robert Kirk for fixing some factual errors regarding specific implementations of RLHF. Thanks to Stas Bekman for fixing some typos or confusing phrases Thanks to Peter Stone , Khanh X. Nguyen and Yoav Artzi for helping expand the related works further into history. Thanks to Igor Kotenkov for pointing out a technical error in the KL-penalty term of the RLHF procedure, its diagram, and textual description.

More Articles from our Blog

summarizing books with human feedback

Constitutional AI with Open LLMs

By  vwxyzjn February 1, 2024 • 4

summarizing books with human feedback

Preference Tuning LLMs with Direct Preference Optimization Methods

By  kashif January 18, 2024 • 16

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Code for "Learning to summarize from human feedback"

openai/summarize-from-feedback

Folders and files, repository files navigation.

Status: Archive (code is provided as-is, no updates expected)

Learning to Summarize from Human Feedback

This repository contains code to run our models, including the supervised baseline, the trained reward model, and the RL fine-tuned policy.

Supported platform: Python 3.7 64-bit on Ubuntu 18.04

Install pipenv .

Clone this repo. Then, inside it:

Run the models

You'll need to run this on a machine with an Nvidia GPU.

First, let's run some tests to make sure everything is working.

Now let's run some actual evaluations. We can have the model summarize some posts from the validation set:

This will output to /tmp/jobs/sample-ppo-xl/results/ .

Now we can evaluate them using the reward model:

This will print some aggregate statistics and output scores for each sample to /tmp/jobs/eval-rm4/results/ .

Human feedback data

We've released our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores).

The dataset is stored in Azure Blob Storage, split into two directories described below: comparisons and axis_evals . You can download it by running azcopy copy "https://openaipublic.blob.core.windows.net/summarize-from-feedback/dataset/*" . --recursive .

You can also explore the data by hand on our dataset website .

Comparisons

https://openaipublic.blob.core.windows.net/summarize-from-feedback/dataset/comparisons contains labeled comparisons between pairs of summaries as jsonl files, where each line represents a single comparison. Here is a formatted example:

note fields contain the naive interpretation notes written by the worker before seeing the post (but possibly edited afterwards). May be null.

split will always be train , valid1 , or valid2 ; posts / articles marked with valid1 were used to select models during training, so we restricted to valid2 labels for final evaluations.

The training data for sup4 is found in comparisons/batch3.json through comparisons/batch10.json ; later batches are primarily evaluation.

https://openaipublic.blob.core.windows.net/summarize-from-feedback/dataset/axis_evals contains ratings of summaries along several axes, again as jsonl files. Here is a formatted example:

Reddit TL;DR dataset

Our filtered versions of the TL;DR dataset are available here:

https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/train.jsonl https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/valid.jsonl https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/test.jsonl https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/samples.txt

For details on the original TL;DR dataset, see Syed et al 2018 by Syed, Shahbaz, Voelske, Michael, Potthast, Martin, & Stein, Benno (2018). It is licensed under CC BY 4.0 .

Contributors 4

@daniel-ziegler

  • Python 100.0%

Subscribe to the PwC Newsletter

Join the community, edit social preview.

summarizing books with human feedback

Add a new code entry for this paper

Remove a code repository from this paper.

summarizing books with human feedback

Mark the official implementation from paper authors

Add a new evaluation result row, remove a task, add a method, remove a method, edit datasets, learning to summarize with human feedback.

NeurIPS 2020  ·  Nisan Stiennon , Long Ouyang , Jeffrey Wu , Daniel Ziegler , Ryan Lowe , Chelsea Voss , Alec Radford , Dario Amodei , Paul F. Christiano · Edit social preview

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about---summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models. We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit add remove, methods edit add remove.

AI ALIGNMENT FORUM AF

"summarizing books with human feedback" (recursive gpt-3).

I'd heard about this before, but not the alignment spin on it. This is more interesting to me from a capabilities standpoint than an alignment standpoint, so I had assumed that this was motivated by the normal incentives for capabilities research. I'd be interested if I'm in fact wrong, or if it seems more alignment-y to other people.

From an alignment perspective the main point is that the required human data does not scale with the length of the book (or maybe scales logarithmically). In general we want evaluation procedures that scale gracefully, so that we can continue to apply them even for tasks where humans can't afford to produce or evaluate any training examples.

The approach in this paper will produce worse summaries than fine-tuning a model end-to-end. In order to produce good summaries, you will ultimately need to use more sophisticated decompositions---for example, if a character appears in the 2nd chapter of a book who is not mentioned in the first chapter's summary, you want to answer "what is this character's deal?" and have it answered by a model who has read the first chapter. And long run you need to handle a lot of issues that don't decompose so cleanly.

If you don't have more sophisticated decompositions, and you go past the scale where humans can provide oversight directly, then you will be forced to use proxies that are available end-to-end, resulting in traditional safety problems. For example, you have to evaluate "did this plan result in high reward when we ran it?" rather than "does this plan look good before we run it?" because the plan is too complex and vast for a human to understand.

So I think this is a good problem and first baseline for studying recursive decompositions and trying to have them be competitive with end-to-end supervision. You could also hope to learn something about e.g. how errors propagate in recursive supervision schemes, and generally what happens when you use ML to approximate large collaborations of humans on tasks that are too large for humans to handle them directly.

This in turn seems like an important problem to start working on given that most concrete plans for overseeing powerful models appear to require this kind of recursive oversight (and most safety concerns can be cashed out as limitations with that kind of recursive oversight). So in general I think you should be excited about people pushing these methods as far as they can.

I had assumed that this was motivated by the normal incentives for capabilities research. I'd be interested if I'm in fact wrong, or if it seems more alignment-y to other people.

It was primarily motivated as a first experiment in doing recursive oversight. In retrospect I think we should probably have gone straight for more complicated decompositions rather than starting with the simple baseline.

The fact that they apply a model with a 2k-token context to books is more about capabilities than alignment, although it still raises many of the same questions about e.g. cascading errors and how to deal with decomposition in practice. I think from an alignment perspective you'd want to work on overseeing the longest context that the model can handle (if you want decompositions to be interesting that likely means working with longer contexts than GPT-3; this is a general instance of the phenomenon that it's hard to practice overseeing superhuman systems while all of our systems are subhuman on most axes).

Ah, yeah. I guess this connection makes perfect sense if we're imagining supervising black-box-esque AIs that are passing around natural language plans.

Although that supervision problem is more like... summarizing Alice in Wonderland if all the pages had gotten ripped out and put back in random order. Or something. But sure, baby steps

IMAGES

  1. OpenAI's Summarizing Books with Human Feedback: Accurate Insights

    summarizing books with human feedback

  2. 235. Recursively Summarizing Books with Human Feedback

    summarizing books with human feedback

  3. Recursively Summarizing Books with Human Feedback

    summarizing books with human feedback

  4. Summarizing books with human feedback

    summarizing books with human feedback

  5. The Ultimate Guide to Summarizing Books: How to Distill Ideas to

    summarizing books with human feedback

  6. Learning to summarize from human feedback (Paper Explained)

    summarizing books with human feedback

VIDEO

  1. Benefits of reading books human development #better #life #odyssey

  2. Summarizing books using Meta AI/Llama 3

  3. 3 Best Psychology books

  4. Leveraging AI to Outsmart Competitors: Evan Dunn Talks Strategy

  5. How I easily completed 1000 books using ChatGPT

  6. 9 AI Game-Changers for Business Automation!

COMMENTS

  1. Summarizing books with human feedback

    Consider the task of summarizing a piece of text. Large pretrained models aren't very good at summarization.In the past we found that training a model with reinforcement learning from human feedback helped align model summaries with human preferences on short posts and articles. But judging summaries of entire books takes a lot of effort to do directly since a human would need to read the ...

  2. [2109.10862] Recursively Summarizing Books with Human Feedback

    Recursively Summarizing Books with Human Feedback. Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano. A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of ...

  3. Recursively Summarizing Books with Human Feedback

    Recursively Summarizing Books with Human Feedback Jeff Wu Long Ouyang Daniel M. Ziegler Nisan Stiennon Ryan Lowe Jan Leike Paul Christiano OpenAI Abstract A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present

  4. Learning to summarize with human feedback

    In particular, our 1.3 billion parameter (1.3B) model trained with human feedback outperforms our 12B model trained only with supervised learning. Summaries from both our 1.3B and 6.7B human feedback models are preferred by our labelers to the original human-written TL;DRs in the dataset. People make different trade-offs when writing summaries ...

  5. [2009.01325] Learning to summarize from human feedback

    Learning to summarize from human feedback. As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough ...

  6. Recursively Summarizing Books with Human Feedback

    At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of ...

  7. Recursively Summarizing Books with Human Feedback

    This method combines learning from human feedback with recursive task decomposition: it uses models trained on smaller parts of the task to assist humans in giving feedback on the broader task, and generates sensible summaries of entire books. A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate.

  8. PDF Learning to summarize from human feedback

    trained via supervised learning. Summaries from our human feedback models are preferred by our labelers to the original human demonstrations in the dataset (see Figure 1). (2) We show human feedback models generalize much better to new domains than supervised models. Our Reddit-trained human feedback models also generate high-quality summaries ...

  9. Recursively Summarizing Books with Human Feedback

    Recursively Summarizing Books with Human Feedback. A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human ...

  10. Learning to summarize with human feedback

    Learning to summarize with human feedback. Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020) AuthorFeedback Bibtex MetaReview Paper Review Supplemental. Authors. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F. Christiano. Abstract.

  11. Recursively Summarizing Books with Human Feedback

    Recursively Summarizing Books with Human Feedback . A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human ...

  12. Recursively Summarizing Books with Human Feedback

    Recursively Summarizing Books with Human Feedback. A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human ...

  13. Recursively Summarizing Books with Human Feedback

    Recursively Summarizing Books with Human Feedback. TL;DR: This paper used a large volume of demonstrations and comparisons from human labelers, and fine-tuned GPT-3 using behavioral cloning and reward modeling to do summarization recursively. Abstract: A major challenge for scaling machine learning is training models to perform tasks that are ...

  14. 2109.10862

    Recursively Summarizing Books with Human Feedback (2109.10862) Published Sep 22, 2021 ... Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from ...

  15. OpenAI's Summarizing Books with Human Feedback: Accurate Insights

    With Summarizing Books with Human Feedback, you can quickly access crucial insights and key points from lengthy books, saving you valuable time and effort. Our summaries are meticulously crafted through the synergy of AI-powered summarization and meticulous human review. The result is a comprehensive summary that effectively captures the ...

  16. Good read: AI

    2021-11-09 | Constantin Bosse. "Summarizing Books with Human Feedback" was published on OpenAI.com. It is a quick read, very clear and well written. They share some examples from on how they had AI summarizing books. In the article they give the example of the classic "Alice's Adventures in Wonderland", by Lewis Carroll which they ...

  17. Illustrating Reinforcement Learning from Human Feedback (RLHF)

    Reinforcement learning from Human Feedback (also referenced as RL from human preferences) is a challenging concept because it involves a multiple-model training process and different stages of deployment. ... Also, Recursively Summarizing Books with Human Feedback (OpenAI Alignment Team 2021), follow on work summarizing books. WebGPT: Browser ...

  18. Code for "Learning to summarize from human feedback"

    Human feedback data. We've released our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores). The dataset is stored in Azure Blob Storage, split into two directories described ...

  19. Learning to summarize with human feedback

    Learning to summarize with human feedback. As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough ...

  20. "Summarizing Books with Human Feedback" (recursive

    68 Common misconceptions about OpenAI. "Summarizing Books with Human Feedback" (recursive GPT-3) 15th Nov 2021. 6 Charlie Steiner. 24 Paul Christiano. 2 Charlie Steiner.

  21. Summarizing Books with Human Feedback

    GPT-3 powered book summarization OpenAI trained the model on a subset of the books in GPT-3's training dataset that were mostly of the fiction variety a...

  22. Summarizing Books with Human Feedback

    >In the past we found that training a model with reinforcement learning from human feedback helped align model summaries with human preferences on short posts and articles. But judging summaries of entire books takes a lot of effort to do directly since a human would need to read the entire book, which takes many hours.

  23. Welcome to the AI dystopia no one asked for, courtesy of Silicon ...

    A couple of weeks ago, I rewatched Jurassic Park for probably the 10th time since the movie came out 30 years ago. (As an aside, it really holds up — 10/10, no notes.)