MIT IAP AI Safety Class

Course Overview

Description: MIT's introductory course on AI safety, focusing on empirical ML that helps mitigate catastrophic risks from AI. Topics include reinforcement learning (from human feedback), jailbreaking large language models, transformer circuits, superposition of neural networks, and detecting deception in ML models. Gives exposure to foundational results as well as cutting-edge results from this emerging field. The class will have two labs, where instructors will guide students through implementation of techniques taught in lectures.

Prerequisites: 6:3900 (6.036) or equivalent.

Instructors: Eric Gan, Eleni Shor, Kaivu Hariharan

Logistics

Dates: Weeks of 1-15-24 and 1-22-24
Classes: Monday, Tuesday, Wednesday from 3 - 4:30 PM
Labs: Thursday from 2 - 5 PM
Room: 36-112 (both lectures and labs)
Google Calendar Link

Schedule

Date	Time	Topic	Material
Mon 1-15	3 - 4:30 PM	Class 1: Introduction	• Guest lecture, Professor Dylan Hadfield-Menell (Zoom)
Tue 1-16	3 - 4:30 PM	Class 2: Reinforcement Learning	• POMDPs • Policy gradients • Reward specification • Goal generalization
Wed 1-17	3 - 4:30 PM	Class 3: ChatGPT Alignment	• Reward models • RLHF • Jailbreaking language models
Thu 1-18	2 - 5 PM	Lab 1	• PyTorch basics • RLHF • Jailbreaking
Mon 1-22	3 - 4:30 PM	Class 4: Transformers	• Transformer architecture • Induction heads • Transformer circuits
Tue 1-23	3 - 4:30 PM	Class 5: Superposition	• Feature visualization • The superposition hypothesis • Sparse autoencoders
Wed 1-24	3 - 4:30 PM	Class 6: Scalable Alignment	• Scaling laws and emergence • Model evaluations • Detecting deception
Thu 1-25	2 - 5 PM	Lab 2	• Build a transformer • Sparse autoencoders • Modular addition

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIT IAP AI Safety Class

Course Overview

Logistics

Schedule

About

Releases

Packages

ejcgan/ejcgan.github.io

Folders and files

Latest commit

History

Repository files navigation

MIT IAP AI Safety Class

Course Overview

Logistics

Schedule

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages