Approaches: CoT, SC, ToT, Autogen, AlphaCodium
DoDs:
- improve accuracy of plan execution
- break down the task into smaller steps
- evaluate/reflect on each step
- [optionally] be able to cancel/pause execution
- more transparency/observability for execution
How to evaluate?
- find existing dataset/eval
- create our own dataset/eval (10-100 examples, 1-2 files each)