AMAZON
This short tutorial explains the training objectives used to develop ChatGPT, the new chatbot language model from OpenAI.
Timestamps:
0:00 – Non-intro
0:24 – Training overview
1:33 – Generative pretraining (the raw language model)
4:18 – The alignment problem
6:26 – Supervised fine-tuning
7:19 – Limitations of supervision: distributional shift
8:50 – Reward learning based on preferences
10:39 – Reinforcement learning from human feedback
13:02 – Room for improvement
ChatGPT: https://openai.com/blog/chatgpt
Relevant papers for learning more:
InstructGPT: Ouyang et al., 2022 – https://arxiv.org/abs/2203.02155
GPT-3: Brown et al., 2020 – https://arxiv.org/abs/2005.14165
PaLM: Chowdhery et al., 2022 – https://arxiv.org/abs/2204.02311
Efficient reductions for imitation learning: Ross & Bagnell, 2010 – https://proceedings.mlr.press/v9/ross10a.html
Deep reinforcement learning from human preferences: Christiano et al., 2017 – https://arxiv.org/abs/1706.03741
Learning to summarize from human feedback: Stiennon et al., 2020 – https://arxiv.org/abs/2009.01325
Scaling laws for reward model overoptimization: Gao et al., 2022 – https://arxiv.org/abs/2210.10760
Proximal policy optimization algorithms: Schulman et al., 2017 – https://arxiv.org/abs/1707.06347
Special thanks to Elmira Amirloo for feedback on this video.
Links:
YouTube: https://www.youtube.com/ariseffai
Twitter: https://twitter.com/ari_seff
Homepage: https://www.ariseff.com
If you’d like to help support the channel (completely optional), you can donate a cup of coffee via the following:
Venmo: https://venmo.com/ariseff
PayPal: https://www.paypal.me/ariseff