Experiment with different agent approaches

Approaches: CoT, SC, ToT, Autogen, AlphaCodium

DoDs: 
* improve accuracy of plan execution 
* break down the task into smaller steps
* evaluate/reflect on each step
* [optionally] be able to cancel/pause execution
* more transparency/observability for execution

How to evaluate?
* find existing dataset/eval
* create our own dataset/eval (10-100 examples, 1-2 files each)