Appendix · Safety & alignment
Appendix: Index · Prev: Autonomy
Overview#
Safety and alignment work spans training-time alignment (preference learning), evaluation, interpretability, and deployment-time controls. The practical failure modes are specification gaps, distribution shift, and misuse.
1) Preference learning and instruction tuning#
Key papers#
- InstructGPT (Ouyang et al., 2022)
- Constitutional AI (Bai et al., 2022)
- Direct Preference Optimization (Rafailov et al., 2023)