diff --git a/experiments/buddy-benchmarks/README.md b/experiments/buddy-benchmarks/README.md
new file mode 100644
index 0000000..038b974
--- /dev/null
+++ b/experiments/buddy-benchmarks/README.md
@@ -0,0 +1,149 @@
+# Buddy-MLIR Gemmini Performance Benchmarks
+
+Performance evaluation of [Buddy-MLIR](https://github.com/buddy-compiler/buddy-mlir)'s Gemmini dialect backend,
+benchmarked against the Gemmini C reference implementation on Spike simulator.
+
+For the full lowering pipeline and setup instructions, see [WORKFLOW.md](WORKFLOW.md).
+
+## Performance Results
+
+### Matmul Workloads
+
+| Workload | Dataflow | Gemmini C cycles | Buddy cycles | Checksum Match | Speedup |
+|----------|----------|------------------|--------------|----------------|---------|
+| MLP2 (64x832) | WS | 2,528 | 409 | ✓ 252338 | 6.18x |
+| MLP2 (64x832) | OS | 207,782 | 96,076 | ✓ 252338 | 2.16x |
+| MLP1 (6-layer) | WS | 25,251 | 2,539 | ✓ 258664 | 9.95x |
+| softmax matmul (31x30x66) | WS | 335 | 145 | ✓ 3860 | 2.31x |
+| IGELU matmul (30x30x30) | WS | 133 | 133 | ✓ -23260 | 1.00x |
+
+### Conv Workloads
+
+After the conv-encoding fix ([buddy-compiler/buddy-mlir#689](https://github.com/buddy-compiler/buddy-mlir/pull/689)):
+
+| Workload | CPU cycles | Gemmini C cycles | Buddy cycles | Checksum Match | Buddy vs Gemmini C |
+|----------|-----------|------------------|--------------|----------------|---------------------|
+| conv (17x17, k=3, stride=2) | 7,559,913 | 1,027 | 149 | ✓ 950 | 6.89x |
+| conv_with_pool (17x17, k=3, pool=3) | 7,714,291 | 1,605 | 172 | ✓ 30827 | 9.33x |
+
+### ResNet50 Layer Validation
+
+| Layer | Gemmini C cycles | Buddy cycles | Checksum Match | Speedup |
+|-------|------------------|--------------|----------------|---------|
+| Conv1 (7x7, stride=2, pool) | 225,146 | 7,313 | ✓ 10206332 | 30.8x |
+
+## Methodology
+
+- **Simulator**: Spike ISA simulator with Gemmini extension (`dim=16`)
+- **Cycle measurement**: `rdcycle` instruction around the accelerator call
+  (between `gemmini_flush(0)` and `gemmini_fence()`)
+- **Validation**: Output checksums compared between Buddy-MLIR and Gemmini C reference
+- **Gemmini C reference**: `tiled_matmul_auto` / `tiled_conv_auto` from
+  [gemmini-rocc-tests](https://github.com/ucb-bar/gemmini-rocc-tests)
+
+### Important Caveats
+
+The `rdcycle` counter measures **CPU instructions executed**, not wall-clock time or
+Gemmini hardware execution time. Buddy-MLIR's compile-time loop unrolling reduces
+host-side loop overhead (fewer `rdcycle` ticks for loop control), making the cycle
+counts appear faster even when the underlying Gemmini hardware work is identical.
+
+The speedup numbers reflect reduced host-side orchestration overhead, not necessarily
+faster accelerator throughput.
+
+### Buddy-MLIR Conv Encoding Fix
+
+The conv benchmarks require the fix from [buddy-compiler/buddy-mlir#689](https://github.com/buddy-compiler/buddy-mlir/pull/689),
+which corrects the `im2col` encoding for convolutions in the Gemmini lowering path.
+Without this fix, conv outputs produce incorrect checksums.
+
+## Directory Structure
+
+```
+experiments/buddy-benchmarks/
+├── README.md                          # This file
+├── scripts/
+│   └── run_benchmark.sh               # Run all benchmarks on Spike
+├── kernels/
+│   ├── Makefile                        # Build all kernel benchmarks
+│   ├── conv/
+│   │   ├── conv-buddy.mlir            # 17x17 conv, k=3, stride=2
+│   │   └── conv-buddy.c              # C harness
+│   ├── conv-with-pool/
+│   │   ├── conv-with-pool-buddy.mlir  # Conv + 3x3 maxpool
+│   │   └── conv-with-pool-buddy.c
+│   ├── mlp2/
+│   │   ├── mlp2-buddy.mlir           # 2-layer MLP (WS)
+│   │   ├── mlp2-buddy-os.mlir        # 2-layer MLP (OS)
+│   │   └── mlp2-buddy.c
+│   ├── mlp1/
+│   │   ├── mlp1-buddy.mlir           # 6-layer MLP
+│   │   └── mlp1-buddy.c
+│   ├── softmax-matmul/
+│   │   ├── softmax-matmul-buddy.mlir
+│   │   └── softmax-matmul-buddy.c
+│   └── igelu-matmul/
+│       ├── igelu-matmul-buddy.mlir
+│       └── igelu-matmul-buddy.c
+├── resnet50/
+│   ├── Makefile                        # Build + validate ResNet50 conv1
+│   ├── conv1-buddy.mlir               # ResNet50 conv1 (7x7, stride=2, pool)
+│   ├── conv1-buddy.c                  # Buddy C harness
+│   ├── conv1-gemmini.c                # Gemmini C reference
+│   ├── conv1-bad-buddy.mlir           # Intentional bad case (wrong stride)
+│   └── conv1-bad-buddy.c
+└── logs/                               # Reference Spike output logs
+    ├── conv1-gemmini.log
+    ├── conv1-buddy.log
+    └── conv1-bad-buddy.log
+```
+
+## How to Reproduce
+
+### Prerequisites
+
+- RISC-V GNU toolchain (GCC cross-compiler for `riscv64-unknown-elf`)
+- [Buddy-MLIR](https://github.com/buddy-compiler/buddy-mlir) built with Gemmini dialect
+  (`buddy-opt`, `buddy-translate`, `buddy-llc`)
+- [Spike](https://github.com/riscv-software-src/riscv-isa-sim) ISA simulator with Gemmini extension
+- [gemmini-rocc-tests](https://github.com/ucb-bar/gemmini-rocc-tests) (for headers and baremetal runtime)
+
+### Build and Run Kernel Benchmarks
+
+```bash
+cd experiments/buddy-benchmarks/kernels
+
+# Set paths (adjust to your environment)
+export RISCV=/path/to/riscv-toolchain
+export BUDDY=/path/to/buddy-mlir/build/bin
+
+# Build all benchmarks
+make all
+
+# Run all on Spike
+make run-all
+
+# Or run individual benchmarks
+make run-conv
+make run-mlp2
+make run-mlp1
+```
+
+### Build and Run ResNet50 Validation
+
+```bash
+cd experiments/buddy-benchmarks/resnet50
+
+# Build all (Gemmini C reference + Buddy + intentional bad case)
+make all
+
+# Run full validation suite (compares checksums automatically)
+make validate
+```
+
+### Run Everything
+
+```bash
+cd experiments/buddy-benchmarks
+./scripts/run_benchmark.sh
+```
diff --git a/experiments/buddy-benchmarks/WORKFLOW.md b/experiments/buddy-benchmarks/WORKFLOW.md
new file mode 100644
index 0000000..49a5b87
--- /dev/null
+++ b/experiments/buddy-benchmarks/WORKFLOW.md
@@ -0,0 +1,356 @@
+# Buddy-MLIR Gemmini Workflow: From MLIR to Execution on Spike
+
+This document describes the complete pipeline for compiling Gemmini dialect MLIR
+to bare-metal RISC-V and running it on the Spike ISA simulator with the Gemmini
+accelerator extension.
+
+## Pipeline Overview
+
+```
+                        ┌──────────────────────┐
+                        │  Gemmini MLIR Source  │
+                        │   (gemmini.tile_*)    │
+                        └──────────┬───────────┘
+                                   │
+                          buddy-opt --lower-gemmini
+                            + standard MLIR passes
+                                   │
+                                   ▼
+                        ┌──────────────────────┐
+                        │  LLVM Dialect MLIR    │
+                        │   (.llvm.mlir)        │
+                        └──────────┬───────────┘
+                                   │
+                          buddy-translate --buddy-to-llvmir
+                                   │
+                                   ▼
+                        ┌──────────────────────┐
+                        │     LLVM IR (.ll)     │
+                        └──────────┬───────────┘
+                                   │
+                          buddy-llc -mattr=+buddyext
+                            -mtriple=riscv64-unknown-elf
+                                   │
+                                   ▼
+                        ┌──────────────────────┐
+                        │  RISC-V Object (.o)   │
+                        │  (RoCC custom insns)  │
+                        └──────────┬───────────┘
+                                   │
+                          riscv64-unknown-elf-gcc
+                            link with C harness
+                            + baremetal runtime
+                                   │
+                                   ▼
+                        ┌──────────────────────┐
+                        │  Bare-metal ELF       │
+                        └──────────┬───────────┘
+                                   │
+                          spike --extension=gemmini
+                                   │
+                                   ▼
+                        ┌──────────────────────┐
+                        │  Gemmini Simulator    │
+                        │  Output + Cycles      │
+                        └──────────────────────┘
+```
+
+## Prerequisites
+
+### 1. RISC-V GNU Toolchain
+
+A bare-metal cross-compiler targeting `riscv64-unknown-elf`:
+
+```bash
+# Provides: riscv64-unknown-elf-gcc, as, ld, objdump, etc.
+export RISCV=/path/to/riscv-toolchain
+```
+
+Build from source: https://github.com/riscv-collab/riscv-gnu-toolchain
+```bash
+./configure --prefix=$RISCV --with-arch=rv64gc --with-abi=lp64d
+make
+```
+
+### 2. Spike ISA Simulator (with Gemmini extension)
+
+Spike must be built with Gemmini support from the Chipyard repository:
+
+```bash
+# Clone chipyard (includes Gemmini as a generator)
+git clone https://github.com/ucb-bar/chipyard.git
+cd chipyard && ./scripts/init-submodules-no-riscv-tools.sh
+
+# Build Spike with Gemmini extension
+cd sims/spike
+make
+
+# Or use a pre-built spike if available:
+export SPIKE=$RISCV/bin/spike
+```
+
+### 3. Gemmini ROCC Tests (headers + baremetal runtime)
+
+The C harnesses depend on headers and the bare-metal runtime from
+[gemmini-rocc-tests](https://github.com/ucb-bar/gemmini-rocc-tests):
+
+```bash
+export GEMMINI_ROOT=/path/to/chipyard/generators/gemmini/software/gemmini-rocc-tests
+```
+
+Key files used:
+- `include/gemmini.h` — Gemmini C API (`tiled_matmul_auto`, `tiled_conv_auto`, RoCC instruction macros)
+- `include/gemmini_params.h` — Hardware parameters (DIM=16, scratchpad/accumulator sizes)
+- `include/gemmini_testutils.h` — Test utilities (`read_cycles`, checksum helpers)
+- `include/gemmini_nn.h` — NN layer helpers (for ResNet50 reference)
+- `riscv-tests/benchmarks/common/` — Bare-metal startup code (`_start`, printf shims, syscall stubs)
+- `riscv-tests/benchmarks/common/test.ld` — Linker script for bare-metal execution
+
+### 4. Buddy-MLIR (with Gemmini dialect)
+
+Build Buddy-MLIR from source with Gemmini dialect support:
+
+```bash
+# Step 1: Build LLVM/MLIR
+git clone https://github.com/buddy-compiler/buddy-mlir.git
+cd buddy-mlir && git submodule update --init
+mkdir llvm/build && cd llvm/build
+cmake -G Ninja ../llvm \
+    -DLLVM_ENABLE_PROJECTS="mlir" \
+    -DLLVM_TARGETS_TO_BUILD="host;RISCV" \
+    -DCMAKE_BUILD_TYPE=Release
+ninja
+
+# Step 2: Build buddy-mlir
+cd ../../
+mkdir build && cd build
+cmake -G Ninja .. \
+    -DMLIR_DIR=$PWD/../llvm/build/lib/cmake/mlir \
+    -DLLVM_DIR=$PWD/../llvm/build/lib/cmake/llvm \
+    -DCMAKE_BUILD_TYPE=Release
+ninja buddy-opt buddy-translate buddy-llc
+
+export BUDDY=$PWD/bin
+```
+
+**Important:** Conv benchmarks require the fix from
+[buddy-compiler/buddy-mlir#689](https://github.com/buddy-compiler/buddy-mlir/pull/689)
+which corrects the `im2col` encoding in the Gemmini conv lowering.
+
+## Step-by-Step: Compiling a Gemmini MLIR Kernel
+
+Using `conv-buddy.mlir` as an example:
+
+### Step 1: Write the Gemmini dialect MLIR
+
+```mlir
+// conv-buddy.mlir
+module {
+  func.func @conv(%input: memref<2x17x17x18xi8>,
+                  %weights: memref<162x19xi8>,
+                  %bias: memref<19xi32>,
+                  %output: memref<162x19xi8>) attributes { llvm.emit_c_interface } {
+    %c9 = arith.constant 9 : i64
+    %c3 = arith.constant 3 : i64
+    gemmini.tile_conv %input %weights %bias %output %c9 %c9 %c3
+        {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 1,
+         act = 0} :
+        memref<2x17x17x18xi8> memref<162x19xi8> memref<19xi32> memref<162x19xi8>
+        i64 i64 i64
+    return
+  }
+}
+```
+
+The `llvm.emit_c_interface` attribute generates a `_mlir_ciface_conv` wrapper
+callable from C with memref descriptor structs.
+
+Key `gemmini.tile_conv` operands:
+- `%input` — 4D input tensor (batch × height × width × channels)
+- `%weights` — 2D flattened weight matrix (patch_size × out_channels)
+- `%bias` — 1D bias vector
+- `%output` — 2D output matrix (n_patches × out_channels)
+- `%c9 %c9` — output row/col dimensions (before pooling)
+- `%c3` — kernel dimension
+
+Key attributes: `stride`, `padding`, `act` (0=none, 1=ReLU, 3=iGELU, 4=softmax),
+`poolSize`, `poolStride`, `poolPadding`, `dataflow` (0=OS, 1=WS), `bertScale`.
+
+### Step 2: Lower to LLVM dialect
+
+```bash
+buddy-opt conv-buddy.mlir \
+    -lower-gemmini \
+    -convert-scf-to-cf \
+    -convert-arith-to-llvm \
+    -convert-func-to-llvm \
+    -llvm-legalize-for-export \
+    -o conv-buddy.llvm.mlir
+```
+
+Pass breakdown:
+| Pass | What it does |
+|------|-------------|
+| `-lower-gemmini` | `gemmini.tile_conv` → Gemmini intrinsics (`gemmini.intr.loop_conv_ws`, `gemmini.intr.config_ex`, `gemmini.intr.flush`, etc.) with pre-computed tile sizes and constant offsets |
+| `-convert-scf-to-cf` | SCF control flow → branch-based control flow |
+| `-convert-arith-to-llvm` | Arithmetic ops → LLVM dialect |
+| `-convert-func-to-llvm` | Function signatures → LLVM calling convention |
+| `-llvm-legalize-for-export` | Final cleanup for LLVM IR emission |
+
+### Step 3: Translate to LLVM IR
+
+```bash
+buddy-translate conv-buddy.llvm.mlir --buddy-to-llvmir -o conv-buddy.ll
+```
+
+This produces standard LLVM IR with inline assembly for Gemmini's RoCC custom
+instructions (encoded as `.insn r` directives for the RISC-V assembler).
+
+### Step 4: Compile to RISC-V object
+
+```bash
+buddy-llc conv-buddy.ll \
+    -O3 \
+    -filetype=obj \
+    -mtriple=riscv64-unknown-elf \
+    -mattr=+buddyext,+d,+f,+c \
+    -float-abi=hard \
+    -code-model=medium \
+    -o conv-buddy.o
+```
+
+Key flags:
+| Flag | Why |
+|------|-----|
+| `-mattr=+buddyext` | Enables custom Gemmini RoCC instruction support |
+| `-mattr=+d,+f,+c` | Double/float/compressed RISC-V extensions |
+| `-code-model=medium` | Required for large models (avoids `R_RISCV_HI20` relocation overflow) |
+| `-float-abi=hard` | Hardware floating-point ABI |
+
+### Step 5: Write a C harness
+
+The C harness provides `main()`, initializes inputs, and calls the MLIR-generated
+function through the C interface:
+
+```c
+#include "include/gemmini.h"
+#include "include/gemmini_testutils.h"
+
+// Memref descriptor matching MLIR's C interface
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[4];
+  int64_t strides[4];
+} MemRef4D_i8;
+
+// The MLIR-generated function (from llvm.emit_c_interface)
+extern void _mlir_ciface_conv(MemRef4D_i8 *input, ...);
+
+int main(void) {
+    // Initialize inputs, call function, measure cycles with rdcycle
+    gemmini_flush(0);
+    uint64_t start = read_cycles();
+    _mlir_ciface_conv(&input_ref, &weights_ref, &bias_ref, &output_ref);
+    gemmini_fence();
+    uint64_t end = read_cycles();
+    printf("Cycles: %llu\n", (unsigned long long)(end - start));
+    // Compute and print output checksum for validation
+}
+```
+
+### Step 6: Link into bare-metal ELF
+
+```bash
+riscv64-unknown-elf-gcc \
+    -DPREALLOCATE=1 -DMULTITHREAD=1 -DBAREMETAL=1 \
+    -mcmodel=medany -std=gnu99 -O2 -ffast-math \
+    -fno-common -fno-builtin-printf \
+    -fno-tree-loop-distribute-patterns \
+    -march=rv64gc -Wa,-march=rv64gc \
+    -nostdlib -nostartfiles -static \
+    -T $GEMMINI_ROOT/riscv-tests/benchmarks/common/test.ld \
+    -I$GEMMINI_ROOT/riscv-tests -I$GEMMINI_ROOT/riscv-tests/env \
+    -I$GEMMINI_ROOT -I$GEMMINI_ROOT/include \
+    -I$GEMMINI_ROOT/riscv-tests/benchmarks/common \
+    conv-buddy.c conv-buddy.o \
+    $GEMMINI_ROOT/riscv-tests/benchmarks/common/*.c \
+    $GEMMINI_ROOT/riscv-tests/benchmarks/common/*.S \
+    -lm -lgcc \
+    -o conv-baremetal
+```
+
+The bare-metal runtime from `benchmarks/common/` provides:
+- `_start` entry point and C runtime initialization
+- `printf` via HTIF (Host-Target Interface) syscalls
+- Memory management stubs
+
+### Step 7: Run on Spike
+
+```bash
+spike --extension=gemmini conv-baremetal
+```
+
+Spike simulates the RISC-V core with the Gemmini systolic array extension
+(default config: 16×16 PEs, weight-stationary dataflow). The `--extension=gemmini`
+flag loads the Gemmini functional model that intercepts RoCC custom instructions.
+
+Example output:
+```
+Buddy conv cycles: 149
+Buddy conv output checksum: 950
+Gemmini extension configured with:
+    dim = 16
+```
+
+## Using the Makefiles
+
+Instead of running each step manually, use the provided Makefiles:
+
+```bash
+# Kernel benchmarks (conv, mlp, etc.)
+cd experiments/buddy-benchmarks/kernels
+make conv-baremetal          # Build one benchmark
+make all                     # Build all benchmarks
+make run-conv                # Build + run on Spike
+make run-all                 # Run everything
+
+# ResNet50 layer validation
+cd experiments/buddy-benchmarks/resnet50
+make all                     # Build Gemmini C ref + Buddy + bad case
+make validate                # Run all three and compare checksums
+
+# Or run the full suite:
+cd experiments/buddy-benchmarks
+./scripts/run_benchmark.sh
+```
+
+## Why Buddy-MLIR Shows Fewer Cycles
+
+The `rdcycle` instruction counts **CPU instructions executed**, not Gemmini
+hardware cycles. Buddy-MLIR's lowering pre-computes tile sizes, loop bounds, and
+memory offsets at compile time, emitting a flat sequence of Gemmini intrinsic calls
+with constant arguments. In contrast, the Gemmini C reference (`tiled_matmul_auto`)
+performs runtime tile-size search, per-tile address arithmetic, and loop iteration —
+all of which execute on the CPU and inflate the `rdcycle` count.
+
+The underlying Gemmini hardware work (systolic array compute, DMA transfers) is
+the same in both cases. The speedup reflects reduced **host-side orchestration
+overhead**, not faster accelerator throughput. This advantage would still manifest
+on real hardware, since the CPU is freed up sooner for other work.
+
+## Gemmini MLIR Operations Reference
+
+| Operation | Description | Key Attributes |
+|-----------|-------------|----------------|
+| `gemmini.tile_matmul` | Tiled matrix multiply | `dataflow` (0=OS, 1=WS), `act` (0/1/3/4) |
+| `gemmini.tile_conv` | Tiled convolution (im2col) | `stride`, `padding`, `poolSize`, `poolStride`, `act` |
+| `gemmini.intr.flush` | Flush Gemmini command queue | — |
+| `gemmini.intr.config_ex` | Configure execution mode | dataflow, activation, scale |
+| `gemmini.intr.loop_ws` | Weight-stationary matmul loop | tile dimensions, addresses |
+| `gemmini.intr.loop_conv_ws` | Weight-stationary conv loop | conv parameters, addresses |
+
+Activation functions: 0=none, 1=ReLU, 3=iGELU, 4=softmax
+
+Dataflows: 0=output-stationary (accumulates in place), 1=weight-stationary (keeps weights in scratchpad)
diff --git a/experiments/buddy-benchmarks/kernels/Makefile b/experiments/buddy-benchmarks/kernels/Makefile
new file mode 100644
index 0000000..c54f737
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/Makefile
@@ -0,0 +1,225 @@
+# Makefile for Buddy-MLIR Gemmini kernel benchmarks
+#
+# Builds MLIR kernels through buddy-opt -> buddy-translate -> buddy-llc,
+# then links with C harnesses against gemmini-rocc-tests baremetal runtime.
+#
+# Targets:
+#   all               - Build all kernel benchmarks
+#   run-all           - Run all on Spike
+#   <name>-baremetal  - Build a specific benchmark
+#   run-<name>        - Run a specific benchmark on Spike
+#   clean             - Remove build artifacts
+
+# ============== Paths ==============
+RISCV ?= /home/eecs/ashvin.verma/toolchains/riscv
+BUDDY ?= /scratch/ashvin/buddy-mlir/build/bin
+PK ?= /scratch/ashvin/riscv-pk/build/pk
+SPIKE ?= $(RISCV)/bin/spike
+
+GEMMINI_ROOT := /scratch/ashvin/chipyard/generators/gemmini/software/gemmini-rocc-tests
+BENCH_COMMON := $(GEMMINI_ROOT)/riscv-tests/benchmarks/common
+GEMMINI_INCLUDE := $(GEMMINI_ROOT)/include
+MLP_DIR := $(GEMMINI_ROOT)/mlps
+
+# ============== Compilers ==============
+CC := $(RISCV)/bin/riscv64-unknown-elf-gcc
+
+# ============== Flags ==============
+CFLAGS := \
+	-DPREALLOCATE=1 \
+	-DMULTITHREAD=1 \
+	-mcmodel=medany \
+	-std=gnu99 \
+	-O2 \
+	-ffast-math \
+	-fno-common \
+	-fno-builtin-printf \
+	-fno-tree-loop-distribute-patterns \
+	-march=rv64gc -Wa,-march=rv64gc \
+	-I$(GEMMINI_ROOT)/riscv-tests \
+	-I$(GEMMINI_ROOT)/riscv-tests/env \
+	-I$(GEMMINI_ROOT) \
+	-I$(BENCH_COMMON) \
+	-I$(GEMMINI_INCLUDE) \
+	-I$(MLP_DIR) \
+	-Wno-incompatible-pointer-types
+
+CFLAGS_BAREMETAL := \
+	$(CFLAGS) \
+	-nostdlib \
+	-nostartfiles \
+	-static \
+	-T $(BENCH_COMMON)/test.ld \
+	-DBAREMETAL=1
+
+LIBS := -lm -lgcc
+
+# Benchmark common sources
+BENCH_SRCS := $(wildcard $(BENCH_COMMON)/*.c) $(wildcard $(BENCH_COMMON)/*.S)
+
+# ============== Buddy MLIR passes ==============
+BUDDY_OPT_FLAGS := \
+	-lower-gemmini \
+	-convert-scf-to-cf \
+	-convert-arith-to-llvm \
+	-convert-func-to-llvm \
+	-llvm-legalize-for-export
+
+BUDDY_LLC_FLAGS := \
+	-O3 \
+	-filetype=obj \
+	-mtriple=riscv64-unknown-elf \
+	-mattr=+buddyext,+d,+f,+c \
+	-float-abi=hard \
+	-code-model=medium
+
+# ============== Benchmark definitions ==============
+# Each benchmark: (name, mlir-dir, mlir-file, c-file, func-name)
+BENCHMARKS := conv conv-with-pool mlp2 mlp2-os mlp1 softmax-matmul igelu-matmul
+
+# Build directory for intermediate artifacts
+BUILD := build
+
+# ============== Targets ==============
+.PHONY: all clean run-all $(addprefix run-,$(BENCHMARKS))
+
+all: $(addsuffix -baremetal,$(BENCHMARKS))
+
+# ---- Generic MLIR compilation rules ----
+# Pattern: build/<name>.llvm.mlir from <dir>/<file>.mlir
+$(BUILD)/%.llvm.mlir: | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/%.ll: $(BUILD)/%.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/%.o: $(BUILD)/%.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+$(BUILD):
+	mkdir -p $(BUILD)
+
+# ---- conv ----
+$(BUILD)/conv-buddy.llvm.mlir: conv/conv-buddy.mlir | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/conv-buddy.ll: $(BUILD)/conv-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/conv-buddy.o: $(BUILD)/conv-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+conv-baremetal: conv/conv-buddy.c $(BUILD)/conv-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/conv-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ---- conv-with-pool ----
+$(BUILD)/conv-with-pool-buddy.llvm.mlir: conv-with-pool/conv-with-pool-buddy.mlir | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/conv-with-pool-buddy.ll: $(BUILD)/conv-with-pool-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/conv-with-pool-buddy.o: $(BUILD)/conv-with-pool-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+conv-with-pool-baremetal: conv-with-pool/conv-with-pool-buddy.c $(BUILD)/conv-with-pool-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/conv-with-pool-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ---- mlp2 (weight-stationary) ----
+$(BUILD)/mlp2-buddy.llvm.mlir: mlp2/mlp2-buddy.mlir | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/mlp2-buddy.ll: $(BUILD)/mlp2-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/mlp2-buddy.o: $(BUILD)/mlp2-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+mlp2-baremetal: mlp2/mlp2-buddy.c $(BUILD)/mlp2-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/mlp2-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ---- mlp2-os (output-stationary) ----
+$(BUILD)/mlp2-buddy-os.llvm.mlir: mlp2/mlp2-buddy-os.mlir | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/mlp2-buddy-os.ll: $(BUILD)/mlp2-buddy-os.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/mlp2-buddy-os.o: $(BUILD)/mlp2-buddy-os.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+mlp2-os-baremetal: mlp2/mlp2-buddy.c $(BUILD)/mlp2-buddy-os.o
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/mlp2-buddy-os.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ---- mlp1 (6-layer) ----
+$(BUILD)/mlp1-buddy.llvm.mlir: mlp1/mlp1-buddy.mlir | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/mlp1-buddy.ll: $(BUILD)/mlp1-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/mlp1-buddy.o: $(BUILD)/mlp1-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+mlp1-baremetal: mlp1/mlp1-buddy.c $(BUILD)/mlp1-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/mlp1-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ---- softmax-matmul ----
+$(BUILD)/softmax-matmul-buddy.llvm.mlir: softmax-matmul/softmax-matmul-buddy.mlir | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/softmax-matmul-buddy.ll: $(BUILD)/softmax-matmul-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/softmax-matmul-buddy.o: $(BUILD)/softmax-matmul-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+softmax-matmul-baremetal: softmax-matmul/softmax-matmul-buddy.c $(BUILD)/softmax-matmul-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/softmax-matmul-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ---- igelu-matmul ----
+$(BUILD)/igelu-matmul-buddy.llvm.mlir: igelu-matmul/igelu-matmul-buddy.mlir | $(BUILD)
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+$(BUILD)/igelu-matmul-buddy.ll: $(BUILD)/igelu-matmul-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+$(BUILD)/igelu-matmul-buddy.o: $(BUILD)/igelu-matmul-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+igelu-matmul-baremetal: igelu-matmul/igelu-matmul-buddy.c $(BUILD)/igelu-matmul-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/igelu-matmul-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ============== Run targets ==============
+run-conv: conv-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-conv-with-pool: conv-with-pool-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-mlp2: mlp2-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-mlp2-os: mlp2-os-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-mlp1: mlp1-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-softmax-matmul: softmax-matmul-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-igelu-matmul: igelu-matmul-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-all: $(addsuffix -baremetal,$(BENCHMARKS))
+	@for bench in $(BENCHMARKS); do \
+		echo "=== Running $$bench ==="; \
+		$(SPIKE) --extension=gemmini $${bench}-baremetal 2>&1; \
+		echo ""; \
+	done
+
+# ============== Clean ==============
+clean:
+	rm -rf $(BUILD)
+	rm -f $(addsuffix -baremetal,$(BENCHMARKS))
diff --git a/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.c b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.c
new file mode 100644
index 0000000..12e3ecc
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.c
@@ -0,0 +1,186 @@
+#include <assert.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "include/gemmini.h"
+#include "include/gemmini_testutils.h"
+
+#define IN_ROW_DIM 17
+#define IN_COL_DIM 17
+#define IN_CHANNELS 18
+#define OUT_CHANNELS 19
+#define BATCH_SIZE 2
+#define KERNEL_DIM 3
+#define PADDING 1
+#define STRIDE 2
+
+#define POOL_SIZE 3
+#define POOL_STRIDE 2
+#define POOL_PADDING 1
+
+#define OUT_ROW_DIM ((IN_ROW_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1)
+#define OUT_COL_DIM ((IN_COL_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1)
+#define PATCH_SIZE (KERNEL_DIM * KERNEL_DIM * IN_CHANNELS)
+#define N_PATCHES (BATCH_SIZE * OUT_ROW_DIM * OUT_COL_DIM)
+
+#define POOL_OUT_ROW_DIM ((OUT_ROW_DIM + 2 * POOL_PADDING - POOL_SIZE) / POOL_STRIDE + 1)
+#define POOL_OUT_COL_DIM ((OUT_COL_DIM + 2 * POOL_PADDING - POOL_SIZE) / POOL_STRIDE + 1)
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[4];
+  int64_t strides[4];
+} MemRef4D_i8;
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[1];
+  int64_t strides[1];
+} MemRef1D_i32;
+
+extern void _mlir_ciface_conv_with_pool(MemRef4D_i8 *input, MemRef2D_i8 *weights,
+                                        MemRef1D_i32 *bias, MemRef2D_i8 *output);
+
+static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1,
+                                   int64_t d2, int64_t d3) {
+  MemRef4D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = d0;
+  ref.sizes[1] = d1;
+  ref.sizes[2] = d2;
+  ref.sizes[3] = d3;
+  ref.strides[3] = 1;
+  ref.strides[2] = d3;
+  ref.strides[1] = d2 * d3;
+  ref.strides[0] = d1 * d2 * d3;
+  return ref;
+}
+
+static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) {
+  MemRef1D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = len;
+  ref.strides[0] = 1;
+  return ref;
+}
+
+static void init_random(elem_t *buf, int len) {
+  for (elem_t *ptr = buf; ptr < buf + len; ptr++) {
+    *ptr = (rand() % 5) - 2;
+  }
+}
+
+static void init_random_acc(acc_t *buf, int len) {
+  for (acc_t *ptr = buf; ptr < buf + len; ptr++) {
+    *ptr = (rand() % 5) - 2;
+  }
+}
+
+static void flatten_weights(int out_channels, int kernel_dim, int in_channels,
+                            int patch_size,
+                            elem_t weights[out_channels][kernel_dim][kernel_dim][in_channels],
+                            elem_t weights_mat[patch_size][out_channels]) {
+  assert(patch_size == kernel_dim * kernel_dim * in_channels);
+  for (int outc = 0; outc < out_channels; outc++) {
+    for (int krow = 0; krow < kernel_dim; krow++) {
+      for (int kcol = 0; kcol < kernel_dim; kcol++) {
+        for (int inc = 0; inc < in_channels; inc++) {
+          int wmatrow = krow * kernel_dim * in_channels +
+              kcol * in_channels + inc;
+          weights_mat[wmatrow][outc] = weights[outc][krow][kcol][inc];
+        }
+      }
+    }
+  }
+}
+
+int main(void) {
+  static elem_t input[BATCH_SIZE][IN_ROW_DIM][IN_COL_DIM][IN_CHANNELS];
+  static elem_t weights[OUT_CHANNELS][KERNEL_DIM][KERNEL_DIM][IN_CHANNELS];
+  static acc_t bias[OUT_CHANNELS];
+  static elem_t weights_mat[PATCH_SIZE][OUT_CHANNELS];
+  static elem_t pool_output_mat[BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM][OUT_CHANNELS];
+
+  init_random(&input[0][0][0][0], sizeof(input) / sizeof(elem_t));
+  init_random(&weights[0][0][0][0], sizeof(weights) / sizeof(elem_t));
+  init_random_acc(&bias[0], sizeof(bias) / sizeof(acc_t));
+  flatten_weights(OUT_CHANNELS, KERNEL_DIM, IN_CHANNELS, PATCH_SIZE,
+                  weights, weights_mat);
+
+  long long input_checksum = 0;
+  elem_t *input_ptr = &input[0][0][0][0];
+  int input_elems = BATCH_SIZE * IN_ROW_DIM * IN_COL_DIM * IN_CHANNELS;
+  for (int i = 0; i < input_elems; ++i) {
+    input_checksum += input_ptr[i];
+  }
+  long long weight_checksum = 0;
+  elem_t *weight_ptr = &weights[0][0][0][0];
+  int weight_elems = OUT_CHANNELS * KERNEL_DIM * KERNEL_DIM * IN_CHANNELS;
+  for (int i = 0; i < weight_elems; ++i) {
+    weight_checksum += weight_ptr[i];
+  }
+  long long bias_checksum = 0;
+  for (int i = 0; i < OUT_CHANNELS; ++i) {
+    bias_checksum += bias[i];
+  }
+  printf("Input checksum: %lld\n", input_checksum);
+  printf("Weights checksum: %lld\n", weight_checksum);
+  printf("Bias checksum: %lld\n", bias_checksum);
+
+  MemRef4D_i8 input_ref =
+      make_memref4_i8(&input[0][0][0][0], BATCH_SIZE, IN_ROW_DIM, IN_COL_DIM,
+                      IN_CHANNELS);
+  MemRef2D_i8 weights_ref =
+      make_memref2_i8(&weights_mat[0][0], PATCH_SIZE, OUT_CHANNELS);
+  MemRef1D_i32 bias_ref = make_memref1_i32(&bias[0], OUT_CHANNELS);
+  MemRef2D_i8 output_ref =
+      make_memref2_i8(&pool_output_mat[0][0],
+                      BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM,
+                      OUT_CHANNELS);
+
+  gemmini_flush(0);
+  uint64_t start = read_cycles();
+  _mlir_ciface_conv_with_pool(&input_ref, &weights_ref, &bias_ref, &output_ref);
+  gemmini_fence();
+  uint64_t end = read_cycles();
+
+  printf("Buddy conv_with_pool cycles: %llu\n",
+         (unsigned long long)(end - start));
+  long long checksum = 0;
+  for (int i = 0; i < BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM; ++i) {
+    for (int j = 0; j < OUT_CHANNELS; ++j) {
+      checksum += pool_output_mat[i][j];
+    }
+  }
+  printf("Buddy conv_with_pool output checksum: %lld\n", checksum);
+  return 0;
+}
diff --git a/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.mlir b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.mlir
new file mode 100644
index 0000000..15383e4
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.mlir
@@ -0,0 +1,15 @@
+module {
+  func.func @conv_with_pool(%input: memref<2x17x17x18xi8>,
+                            %weights: memref<162x19xi8>,
+                            %bias: memref<19xi32>,
+                            %output: memref<50x19xi8>) attributes { llvm.emit_c_interface } {
+    %c9 = arith.constant 9 : i64
+    %c3 = arith.constant 3 : i64
+    gemmini.tile_conv %input %weights %bias %output %c9 %c9 %c3
+        {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 1,
+         act = 0, poolSize = 3, poolStride = 2, poolPadding = 1} :
+        memref<2x17x17x18xi8> memref<162x19xi8> memref<19xi32> memref<50x19xi8>
+        i64 i64 i64
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/kernels/conv/conv-buddy.c b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.c
new file mode 100644
index 0000000..4646503
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.c
@@ -0,0 +1,177 @@
+#include <assert.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "include/gemmini.h"
+#include "include/gemmini_testutils.h"
+
+#define IN_ROW_DIM 17
+#define IN_COL_DIM 17
+#define IN_CHANNELS 18
+#define OUT_CHANNELS 19
+#define BATCH_SIZE 2
+#define KERNEL_DIM 3
+#define PADDING 1
+#define STRIDE 2
+
+#define OUT_ROW_DIM ((IN_ROW_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1)
+#define OUT_COL_DIM ((IN_COL_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1)
+#define PATCH_SIZE (KERNEL_DIM * KERNEL_DIM * IN_CHANNELS)
+#define N_PATCHES (BATCH_SIZE * OUT_ROW_DIM * OUT_COL_DIM)
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[4];
+  int64_t strides[4];
+} MemRef4D_i8;
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[1];
+  int64_t strides[1];
+} MemRef1D_i32;
+
+extern void _mlir_ciface_conv(MemRef4D_i8 *input, MemRef2D_i8 *weights,
+                              MemRef1D_i32 *bias, MemRef2D_i8 *output);
+
+static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1,
+                                   int64_t d2, int64_t d3) {
+  MemRef4D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = d0;
+  ref.sizes[1] = d1;
+  ref.sizes[2] = d2;
+  ref.sizes[3] = d3;
+  ref.strides[3] = 1;
+  ref.strides[2] = d3;
+  ref.strides[1] = d2 * d3;
+  ref.strides[0] = d1 * d2 * d3;
+  return ref;
+}
+
+static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) {
+  MemRef1D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = len;
+  ref.strides[0] = 1;
+  return ref;
+}
+
+static void init_random(elem_t *buf, int len) {
+  for (elem_t *ptr = buf; ptr < buf + len; ptr++) {
+    *ptr = (rand() % 5) - 2;
+  }
+}
+
+static void init_random_acc(acc_t *buf, int len) {
+  for (acc_t *ptr = buf; ptr < buf + len; ptr++) {
+    *ptr = (rand() % 5) - 2;
+  }
+}
+
+static void flatten_weights(int out_channels, int kernel_dim, int in_channels,
+                            int patch_size,
+                            elem_t weights[out_channels][kernel_dim][kernel_dim][in_channels],
+                            elem_t weights_mat[patch_size][out_channels]) {
+  assert(patch_size == kernel_dim * kernel_dim * in_channels);
+  for (int outc = 0; outc < out_channels; outc++) {
+    for (int krow = 0; krow < kernel_dim; krow++) {
+      for (int kcol = 0; kcol < kernel_dim; kcol++) {
+        for (int inc = 0; inc < in_channels; inc++) {
+          int wmatrow = krow * kernel_dim * in_channels +
+              kcol * in_channels + inc;
+          weights_mat[wmatrow][outc] = weights[outc][krow][kcol][inc];
+        }
+      }
+    }
+  }
+}
+
+int main(void) {
+  static elem_t input[BATCH_SIZE][IN_ROW_DIM][IN_COL_DIM][IN_CHANNELS];
+  static elem_t weights[OUT_CHANNELS][KERNEL_DIM][KERNEL_DIM][IN_CHANNELS];
+  static acc_t bias[OUT_CHANNELS];
+  static elem_t weights_mat[PATCH_SIZE][OUT_CHANNELS];
+  static elem_t output_mat[N_PATCHES][OUT_CHANNELS];
+
+  init_random(&input[0][0][0][0], sizeof(input) / sizeof(elem_t));
+  init_random(&weights[0][0][0][0], sizeof(weights) / sizeof(elem_t));
+  init_random_acc(&bias[0], sizeof(bias) / sizeof(acc_t));
+  flatten_weights(OUT_CHANNELS, KERNEL_DIM, IN_CHANNELS, PATCH_SIZE,
+                  weights, weights_mat);
+
+  long long input_checksum = 0;
+  elem_t *input_ptr = &input[0][0][0][0];
+  int input_elems = BATCH_SIZE * IN_ROW_DIM * IN_COL_DIM * IN_CHANNELS;
+  for (int i = 0; i < input_elems; ++i) {
+    input_checksum += input_ptr[i];
+  }
+  long long weight_checksum = 0;
+  elem_t *weight_ptr = &weights[0][0][0][0];
+  int weight_elems = OUT_CHANNELS * KERNEL_DIM * KERNEL_DIM * IN_CHANNELS;
+  for (int i = 0; i < weight_elems; ++i) {
+    weight_checksum += weight_ptr[i];
+  }
+  long long bias_checksum = 0;
+  for (int i = 0; i < OUT_CHANNELS; ++i) {
+    bias_checksum += bias[i];
+  }
+  printf("Input checksum: %lld\n", input_checksum);
+  printf("Weights checksum: %lld\n", weight_checksum);
+  printf("Bias checksum: %lld\n", bias_checksum);
+
+  MemRef4D_i8 input_ref =
+      make_memref4_i8(&input[0][0][0][0], BATCH_SIZE, IN_ROW_DIM, IN_COL_DIM,
+                      IN_CHANNELS);
+  MemRef2D_i8 weights_ref =
+      make_memref2_i8(&weights_mat[0][0], PATCH_SIZE, OUT_CHANNELS);
+  MemRef1D_i32 bias_ref = make_memref1_i32(&bias[0], OUT_CHANNELS);
+  MemRef2D_i8 output_ref =
+      make_memref2_i8(&output_mat[0][0], N_PATCHES, OUT_CHANNELS);
+
+  gemmini_flush(0);
+  uint64_t start = read_cycles();
+  _mlir_ciface_conv(&input_ref, &weights_ref, &bias_ref, &output_ref);
+  gemmini_fence();
+  uint64_t end = read_cycles();
+
+  printf("Buddy conv cycles: %llu\n",
+         (unsigned long long)(end - start));
+  long long checksum = 0;
+  for (int i = 0; i < N_PATCHES; ++i) {
+    for (int j = 0; j < OUT_CHANNELS; ++j) {
+      checksum += output_mat[i][j];
+    }
+  }
+  printf("Buddy conv output checksum: %lld\n", checksum);
+  return 0;
+}
diff --git a/experiments/buddy-benchmarks/kernels/conv/conv-buddy.mlir b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.mlir
new file mode 100644
index 0000000..ba91a8d
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.mlir
@@ -0,0 +1,15 @@
+module {
+  func.func @conv(%input: memref<2x17x17x18xi8>,
+                  %weights: memref<162x19xi8>,
+                  %bias: memref<19xi32>,
+                  %output: memref<162x19xi8>) attributes { llvm.emit_c_interface } {
+    %c9 = arith.constant 9 : i64
+    %c3 = arith.constant 3 : i64
+    gemmini.tile_conv %input %weights %bias %output %c9 %c9 %c3
+        {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 1,
+         act = 0} :
+        memref<2x17x17x18xi8> memref<162x19xi8> memref<19xi32> memref<162x19xi8>
+        i64 i64 i64
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.c b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.c
new file mode 100644
index 0000000..07832f0
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.c
@@ -0,0 +1,121 @@
+#include <stdint.h>
+#include <stdio.h>
+
+#include "include/gemmini.h"
+#include "include/gemmini_testutils.h"
+
+#define MAT_DIM_I 30
+#define MAT_DIM_K 30
+#define MAT_DIM_J 30
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i32;
+
+extern void _mlir_ciface_igelu_matmul(MemRef2D_i8 *a, MemRef2D_i8 *b,
+                                      MemRef2D_i8 *c, MemRef2D_i32 *d);
+
+static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+int main(void) {
+  static elem_t full_A[MAT_DIM_I][MAT_DIM_K] row_align(1);
+  static elem_t full_B[MAT_DIM_K][MAT_DIM_J] row_align(1);
+  static elem_t full_C[MAT_DIM_I][MAT_DIM_J] row_align(1);
+  static acc_t full_D[MAT_DIM_I][MAT_DIM_J] row_align_acc(1);
+
+  for (size_t i = 0; i < MAT_DIM_I; ++i) {
+    for (size_t j = 0; j < MAT_DIM_K; ++j) {
+      full_A[i][j] = (rand() % 3) - 1;
+    }
+  }
+
+  for (size_t i = 0; i < MAT_DIM_K; ++i) {
+    for (size_t j = 0; j < MAT_DIM_J; ++j) {
+      full_B[i][j] = (rand() % 3) - 1;
+    }
+  }
+
+  for (size_t i = 0; i < MAT_DIM_I; ++i) {
+    for (size_t j = 0; j < MAT_DIM_J; ++j) {
+      full_D[i][j] = 0;
+    }
+  }
+
+  long long a_checksum = 0;
+  elem_t *a_ptr = &full_A[0][0];
+  int a_elems = MAT_DIM_I * MAT_DIM_K;
+  for (int i = 0; i < a_elems; ++i) {
+    a_checksum += a_ptr[i];
+  }
+  long long b_checksum = 0;
+  elem_t *b_ptr = &full_B[0][0];
+  int b_elems = MAT_DIM_K * MAT_DIM_J;
+  for (int i = 0; i < b_elems; ++i) {
+    b_checksum += b_ptr[i];
+  }
+  long long d_checksum = 0;
+  acc_t *d_ptr = &full_D[0][0];
+  int d_elems = MAT_DIM_I * MAT_DIM_J;
+  for (int i = 0; i < d_elems; ++i) {
+    d_checksum += d_ptr[i];
+  }
+  printf("A checksum: %lld\n", a_checksum);
+  printf("B checksum: %lld\n", b_checksum);
+  printf("D checksum: %lld\n", d_checksum);
+
+  MemRef2D_i8 a_ref = make_memref_i8(&full_A[0][0], MAT_DIM_I, MAT_DIM_K);
+  MemRef2D_i8 b_ref = make_memref_i8(&full_B[0][0], MAT_DIM_K, MAT_DIM_J);
+  MemRef2D_i8 c_ref = make_memref_i8(&full_C[0][0], MAT_DIM_I, MAT_DIM_J);
+  MemRef2D_i32 d_ref = make_memref_i32(&full_D[0][0], MAT_DIM_I, MAT_DIM_J);
+
+  gemmini_flush(0);
+  uint64_t start = read_cycles();
+  _mlir_ciface_igelu_matmul(&a_ref, &b_ref, &c_ref, &d_ref);
+  gemmini_fence();
+  uint64_t end = read_cycles();
+
+  printf("Buddy igelu matmul cycles: %llu\n",
+         (unsigned long long)(end - start));
+  long long c_checksum = 0;
+  elem_t *c_ptr = &full_C[0][0];
+  int c_elems = MAT_DIM_I * MAT_DIM_J;
+  for (int i = 0; i < c_elems; ++i) {
+    c_checksum += c_ptr[i];
+  }
+  printf("Buddy output checksum: %lld\n", c_checksum);
+  return 0;
+}
diff --git a/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.mlir b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.mlir
new file mode 100644
index 0000000..74dea9f
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.mlir
@@ -0,0 +1,10 @@
+module {
+  func.func @igelu_matmul(%a: memref<30x30xi8>,
+                          %b: memref<30x30xi8>,
+                          %c: memref<30x30xi8>,
+                          %d: memref<30x30xi32>) attributes { llvm.emit_c_interface } {
+    gemmini.tile_matmul %a %b %c %d {act = 3, bertScale = 0.8:f32, dataflow = 1} :
+      memref<30x30xi8> memref<30x30xi8> memref<30x30xi8> memref<30x30xi32>
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.c b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.c
new file mode 100644
index 0000000..4a6630d
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.c
@@ -0,0 +1,153 @@
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+
+#include "include/gemmini.h"
+#include "parameters1.h"
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i32;
+
+extern void _mlir_ciface_mlp1(MemRef2D_i8 *a0, MemRef2D_i8 *w0,
+                              MemRef2D_i8 *c0, MemRef2D_i32 *d0,
+                              MemRef2D_i8 *w1, MemRef2D_i8 *c1,
+                              MemRef2D_i32 *d1, MemRef2D_i8 *w2,
+                              MemRef2D_i8 *c2, MemRef2D_i32 *d2,
+                              MemRef2D_i8 *w3, MemRef2D_i8 *c3,
+                              MemRef2D_i32 *d3, MemRef2D_i8 *w4,
+                              MemRef2D_i8 *c4, MemRef2D_i32 *d4,
+                              MemRef2D_i8 *w5, MemRef2D_i8 *c5,
+                              MemRef2D_i32 *d5);
+
+static uint32_t lcg_state = 777;
+static inline elem_t next_elem(void) {
+  lcg_state = lcg_state * 1664525u + 1013904223u;
+  return (elem_t)((lcg_state >> 24) % 5) - 2;
+}
+
+static void init_random_i8(elem_t *buf, int len) {
+  for (int i = 0; i < len; ++i) {
+    buf[i] = next_elem();
+  }
+}
+
+static inline uint64_t read_cycles(void) {
+  uint64_t cycles;
+  asm volatile("rdcycle %0" : "=r"(cycles));
+  return cycles;
+}
+
+static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static acc_t d0_bias[64][2560] row_align_acc(1) = {0};
+static acc_t d1_bias[64][2048] row_align_acc(1) = {0};
+static acc_t d2_bias[64][1536] row_align_acc(1) = {0};
+static acc_t d3_bias[64][1024] row_align_acc(1) = {0};
+static acc_t d4_bias[64][512] row_align_acc(1) = {0};
+static acc_t d5_bias[64][64] row_align_acc(1) = {0};
+
+int main(void) {
+  lcg_state = 777;
+  init_random_i8(&input_mat[0][0], (int)(sizeof(input_mat) / sizeof(elem_t)));
+  init_random_i8(&weights0[0][0], (int)(sizeof(weights0) / sizeof(elem_t)));
+  init_random_i8(&weights1[0][0], (int)(sizeof(weights1) / sizeof(elem_t)));
+  init_random_i8(&weights2[0][0], (int)(sizeof(weights2) / sizeof(elem_t)));
+  init_random_i8(&weights3[0][0], (int)(sizeof(weights3) / sizeof(elem_t)));
+  init_random_i8(&weights4[0][0], (int)(sizeof(weights4) / sizeof(elem_t)));
+  init_random_i8(&weights5[0][0], (int)(sizeof(weights5) / sizeof(elem_t)));
+
+  memset(inter_results0, 0, sizeof(inter_results0));
+  memset(inter_results1, 0, sizeof(inter_results1));
+  memset(inter_results2, 0, sizeof(inter_results2));
+  memset(inter_results3, 0, sizeof(inter_results3));
+  memset(inter_results4, 0, sizeof(inter_results4));
+  memset(inter_results5, 0, sizeof(inter_results5));
+  memset(d0_bias, 0, sizeof(d0_bias));
+  memset(d1_bias, 0, sizeof(d1_bias));
+  memset(d2_bias, 0, sizeof(d2_bias));
+  memset(d3_bias, 0, sizeof(d3_bias));
+  memset(d4_bias, 0, sizeof(d4_bias));
+  memset(d5_bias, 0, sizeof(d5_bias));
+
+  MemRef2D_i8 a0_ref = make_memref_i8(&input_mat[0][0], 64, 832);
+  MemRef2D_i8 w0_ref = make_memref_i8(&weights0[0][0], 832, 2560);
+  MemRef2D_i8 c0_ref = make_memref_i8(&inter_results0[0][0], 64, 2560);
+  MemRef2D_i32 d0_ref = make_memref_i32(&d0_bias[0][0], 64, 2560);
+
+  MemRef2D_i8 w1_ref = make_memref_i8(&weights1[0][0], 2560, 2048);
+  MemRef2D_i8 c1_ref = make_memref_i8(&inter_results1[0][0], 64, 2048);
+  MemRef2D_i32 d1_ref = make_memref_i32(&d1_bias[0][0], 64, 2048);
+
+  MemRef2D_i8 w2_ref = make_memref_i8(&weights2[0][0], 2048, 1536);
+  MemRef2D_i8 c2_ref = make_memref_i8(&inter_results2[0][0], 64, 1536);
+  MemRef2D_i32 d2_ref = make_memref_i32(&d2_bias[0][0], 64, 1536);
+
+  MemRef2D_i8 w3_ref = make_memref_i8(&weights3[0][0], 1536, 1024);
+  MemRef2D_i8 c3_ref = make_memref_i8(&inter_results3[0][0], 64, 1024);
+  MemRef2D_i32 d3_ref = make_memref_i32(&d3_bias[0][0], 64, 1024);
+
+  MemRef2D_i8 w4_ref = make_memref_i8(&weights4[0][0], 1024, 512);
+  MemRef2D_i8 c4_ref = make_memref_i8(&inter_results4[0][0], 64, 512);
+  MemRef2D_i32 d4_ref = make_memref_i32(&d4_bias[0][0], 64, 512);
+
+  MemRef2D_i8 w5_ref = make_memref_i8(&weights5[0][0], 512, 64);
+  MemRef2D_i8 c5_ref = make_memref_i8(&inter_results5[0][0], 64, 64);
+  MemRef2D_i32 d5_ref = make_memref_i32(&d5_bias[0][0], 64, 64);
+
+  gemmini_flush(0);
+
+  uint64_t start = read_cycles();
+  _mlir_ciface_mlp1(&a0_ref, &w0_ref, &c0_ref, &d0_ref,
+                    &w1_ref, &c1_ref, &d1_ref,
+                    &w2_ref, &c2_ref, &d2_ref,
+                    &w3_ref, &c3_ref, &d3_ref,
+                    &w4_ref, &c4_ref, &d4_ref,
+                    &w5_ref, &c5_ref, &d5_ref);
+  gemmini_fence();
+  uint64_t end = read_cycles();
+
+  printf("Buddy mlp1 cycles: %llu\n", (unsigned long long)(end - start));
+  long long checksum = 0;
+  for (int i = 0; i < 64; ++i) {
+    for (int j = 0; j < 64; ++j) {
+      checksum += inter_results5[i][j];
+    }
+  }
+  printf("Buddy mlp1 output checksum: %lld\n", checksum);
+  return 0;
+}
diff --git a/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.mlir b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.mlir
new file mode 100644
index 0000000..a9d9fa2
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.mlir
@@ -0,0 +1,35 @@
+module {
+  func.func @mlp1(%a0: memref<64x832xi8>,
+                  %w0: memref<832x2560xi8>,
+                  %c0: memref<64x2560xi8>,
+                  %d0: memref<64x2560xi32>,
+                  %w1: memref<2560x2048xi8>,
+                  %c1: memref<64x2048xi8>,
+                  %d1: memref<64x2048xi32>,
+                  %w2: memref<2048x1536xi8>,
+                  %c2: memref<64x1536xi8>,
+                  %d2: memref<64x1536xi32>,
+                  %w3: memref<1536x1024xi8>,
+                  %c3: memref<64x1024xi8>,
+                  %d3: memref<64x1024xi32>,
+                  %w4: memref<1024x512xi8>,
+                  %c4: memref<64x512xi8>,
+                  %d4: memref<64x512xi32>,
+                  %w5: memref<512x64xi8>,
+                  %c5: memref<64x64xi8>,
+                  %d5: memref<64x64xi32>) attributes { llvm.emit_c_interface } {
+    gemmini.tile_matmul %a0 %w0 %c0 %d0 {dataflow = 1, act = 1} :
+      memref<64x832xi8> memref<832x2560xi8> memref<64x2560xi8> memref<64x2560xi32>
+    gemmini.tile_matmul %c0 %w1 %c1 %d1 {dataflow = 1, act = 1} :
+      memref<64x2560xi8> memref<2560x2048xi8> memref<64x2048xi8> memref<64x2048xi32>
+    gemmini.tile_matmul %c1 %w2 %c2 %d2 {dataflow = 1, act = 1} :
+      memref<64x2048xi8> memref<2048x1536xi8> memref<64x1536xi8> memref<64x1536xi32>
+    gemmini.tile_matmul %c2 %w3 %c3 %d3 {dataflow = 1, act = 1} :
+      memref<64x1536xi8> memref<1536x1024xi8> memref<64x1024xi8> memref<64x1024xi32>
+    gemmini.tile_matmul %c3 %w4 %c4 %d4 {dataflow = 1, act = 1} :
+      memref<64x1024xi8> memref<1024x512xi8> memref<64x512xi8> memref<64x512xi32>
+    gemmini.tile_matmul %c4 %w5 %c5 %d5 {dataflow = 1, act = 1} :
+      memref<64x512xi8> memref<512x64xi8> memref<64x64xi8> memref<64x64xi32>
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy-os.mlir b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy-os.mlir
new file mode 100644
index 0000000..34efe2a
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy-os.mlir
@@ -0,0 +1,15 @@
+module {
+  func.func @mlp2(%a0: memref<64x832xi8>,
+                  %w0: memref<832x832xi8>,
+                  %c0: memref<64x832xi8>,
+                  %d0: memref<64x832xi32>,
+                  %w1: memref<832x64xi8>,
+                  %c1: memref<64x64xi8>,
+                  %d1: memref<64x64xi32>) attributes { llvm.emit_c_interface } {
+    gemmini.tile_matmul %a0 %w0 %c0 %d0 {dataflow = 0, act = 1} :
+      memref<64x832xi8> memref<832x832xi8> memref<64x832xi8> memref<64x832xi32>
+    gemmini.tile_matmul %c0 %w1 %c1 %d1 {dataflow = 0, act = 1} :
+      memref<64x832xi8> memref<832x64xi8> memref<64x64xi8> memref<64x64xi32>
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.c b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.c
new file mode 100644
index 0000000..a1e393d
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.c
@@ -0,0 +1,117 @@
+#include <stdint.h>
+#include <stdio.h>
+
+#include "include/gemmini.h"
+#include "parameters2.h"
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i32;
+
+extern void _mlir_ciface_mlp2(MemRef2D_i8 *a0, MemRef2D_i8 *w0,
+                              MemRef2D_i8 *c0, MemRef2D_i32 *d0,
+                              MemRef2D_i8 *w1, MemRef2D_i8 *c1,
+                              MemRef2D_i32 *d1);
+
+static uint32_t lcg_state = 777;
+static inline elem_t next_elem(void) {
+  lcg_state = lcg_state * 1664525u + 1013904223u;
+  return (elem_t)((lcg_state >> 24) % 5) - 2;
+}
+
+static void init_random_i8(elem_t *buf, int len) {
+  for (int i = 0; i < len; ++i) {
+    buf[i] = next_elem();
+  }
+}
+
+static acc_t d0_bias[64][832] row_align_acc(1) = {0};
+static acc_t d1_bias[64][64] row_align_acc(1) = {0};
+
+static inline uint64_t read_cycles(void) {
+  uint64_t cycles;
+  asm volatile("rdcycle %0" : "=r"(cycles));
+  return cycles;
+}
+
+static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+int main(void) {
+  lcg_state = 777;
+  init_random_i8(&input_mat[0][0], (int)(sizeof(input_mat) / sizeof(elem_t)));
+  init_random_i8(&weights0[0][0], (int)(sizeof(weights0) / sizeof(elem_t)));
+  init_random_i8(&weights1[0][0], (int)(sizeof(weights1) / sizeof(elem_t)));
+
+  for (int i = 0; i < 64; ++i) {
+    for (int j = 0; j < 832; ++j) {
+      inter_results0[i][j] = 0;
+      d0_bias[i][j] = 0;
+    }
+  }
+  for (int i = 0; i < 64; ++i) {
+    for (int j = 0; j < 64; ++j) {
+      inter_results1[i][j] = 0;
+      d1_bias[i][j] = 0;
+    }
+  }
+
+  MemRef2D_i8 a0_ref = make_memref_i8(&input_mat[0][0], 64, 832);
+  MemRef2D_i8 w0_ref = make_memref_i8(&weights0[0][0], 832, 832);
+  MemRef2D_i8 c0_ref = make_memref_i8(&inter_results0[0][0], 64, 832);
+  MemRef2D_i32 d0_ref = make_memref_i32(&d0_bias[0][0], 64, 832);
+  MemRef2D_i8 w1_ref = make_memref_i8(&weights1[0][0], 832, 64);
+  MemRef2D_i8 c1_ref = make_memref_i8(&inter_results1[0][0], 64, 64);
+  MemRef2D_i32 d1_ref = make_memref_i32(&d1_bias[0][0], 64, 64);
+
+  gemmini_flush(0);
+
+  uint64_t start = read_cycles();
+  _mlir_ciface_mlp2(&a0_ref, &w0_ref, &c0_ref, &d0_ref,
+                    &w1_ref, &c1_ref, &d1_ref);
+  gemmini_fence();
+  uint64_t end = read_cycles();
+
+  printf("Buddy mlp2 cycles: %llu\n", (unsigned long long)(end - start));
+  long long checksum = 0;
+  for (int i = 0; i < 64; ++i) {
+    for (int j = 0; j < 64; ++j) {
+      checksum += inter_results1[i][j];
+    }
+  }
+  printf("Buddy mlp2 output checksum: %lld\n", checksum);
+  return 0;
+}
diff --git a/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.mlir b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.mlir
new file mode 100644
index 0000000..c513c73
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.mlir
@@ -0,0 +1,15 @@
+module {
+  func.func @mlp2(%a0: memref<64x832xi8>,
+                  %w0: memref<832x832xi8>,
+                  %c0: memref<64x832xi8>,
+                  %d0: memref<64x832xi32>,
+                  %w1: memref<832x64xi8>,
+                  %c1: memref<64x64xi8>,
+                  %d1: memref<64x64xi32>) attributes { llvm.emit_c_interface } {
+    gemmini.tile_matmul %a0 %w0 %c0 %d0 {dataflow = 1, act = 1} :
+      memref<64x832xi8> memref<832x832xi8> memref<64x832xi8> memref<64x832xi32>
+    gemmini.tile_matmul %c0 %w1 %c1 %d1 {dataflow = 1, act = 1} :
+      memref<64x832xi8> memref<832x64xi8> memref<64x64xi8> memref<64x64xi32>
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.c b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.c
new file mode 100644
index 0000000..c83c179
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.c
@@ -0,0 +1,121 @@
+#include <stdint.h>
+#include <stdio.h>
+
+#include "include/gemmini.h"
+#include "include/gemmini_testutils.h"
+
+#define MAT_DIM_I 31
+#define MAT_DIM_K 30
+#define MAT_DIM_J 66
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i32;
+
+extern void _mlir_ciface_softmax_matmul(MemRef2D_i8 *a, MemRef2D_i8 *b,
+                                        MemRef2D_i8 *c, MemRef2D_i32 *d);
+
+static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+int main(void) {
+  static elem_t full_A[MAT_DIM_I][MAT_DIM_K] row_align(1);
+  static elem_t full_B[MAT_DIM_K][MAT_DIM_J] row_align(1);
+  static elem_t full_C[MAT_DIM_I][MAT_DIM_J] row_align(1);
+  static acc_t full_D[MAT_DIM_I][MAT_DIM_J] row_align_acc(1);
+
+  for (size_t i = 0; i < MAT_DIM_I; ++i) {
+    for (size_t j = 0; j < MAT_DIM_K; ++j) {
+      full_A[i][j] = (rand() % 7) - 3;
+    }
+  }
+
+  for (size_t i = 0; i < MAT_DIM_K; ++i) {
+    for (size_t j = 0; j < MAT_DIM_J; ++j) {
+      full_B[i][j] = (rand() % 7) - 3;
+    }
+  }
+
+  for (size_t i = 0; i < MAT_DIM_I; ++i) {
+    for (size_t j = 0; j < MAT_DIM_J; ++j) {
+      full_D[i][j] = 0;
+    }
+  }
+
+  long long a_checksum = 0;
+  elem_t *a_ptr = &full_A[0][0];
+  int a_elems = MAT_DIM_I * MAT_DIM_K;
+  for (int i = 0; i < a_elems; ++i) {
+    a_checksum += a_ptr[i];
+  }
+  long long b_checksum = 0;
+  elem_t *b_ptr = &full_B[0][0];
+  int b_elems = MAT_DIM_K * MAT_DIM_J;
+  for (int i = 0; i < b_elems; ++i) {
+    b_checksum += b_ptr[i];
+  }
+  long long d_checksum = 0;
+  acc_t *d_ptr = &full_D[0][0];
+  int d_elems = MAT_DIM_I * MAT_DIM_J;
+  for (int i = 0; i < d_elems; ++i) {
+    d_checksum += d_ptr[i];
+  }
+  printf("A checksum: %lld\n", a_checksum);
+  printf("B checksum: %lld\n", b_checksum);
+  printf("D checksum: %lld\n", d_checksum);
+
+  MemRef2D_i8 a_ref = make_memref_i8(&full_A[0][0], MAT_DIM_I, MAT_DIM_K);
+  MemRef2D_i8 b_ref = make_memref_i8(&full_B[0][0], MAT_DIM_K, MAT_DIM_J);
+  MemRef2D_i8 c_ref = make_memref_i8(&full_C[0][0], MAT_DIM_I, MAT_DIM_J);
+  MemRef2D_i32 d_ref = make_memref_i32(&full_D[0][0], MAT_DIM_I, MAT_DIM_J);
+
+  gemmini_flush(0);
+  uint64_t start = read_cycles();
+  _mlir_ciface_softmax_matmul(&a_ref, &b_ref, &c_ref, &d_ref);
+  gemmini_fence();
+  uint64_t end = read_cycles();
+
+  printf("Buddy softmax matmul cycles: %llu\n",
+         (unsigned long long)(end - start));
+  long long c_checksum = 0;
+  elem_t *c_ptr = &full_C[0][0];
+  int c_elems = MAT_DIM_I * MAT_DIM_J;
+  for (int i = 0; i < c_elems; ++i) {
+    c_checksum += c_ptr[i];
+  }
+  printf("Buddy output checksum: %lld\n", c_checksum);
+  return 0;
+}
diff --git a/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.mlir b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.mlir
new file mode 100644
index 0000000..d204086
--- /dev/null
+++ b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.mlir
@@ -0,0 +1,10 @@
+module {
+  func.func @softmax_matmul(%a: memref<31x30xi8>,
+                            %b: memref<30x66xi8>,
+                            %c: memref<31x66xi8>,
+                            %d: memref<31x66xi32>) attributes { llvm.emit_c_interface } {
+    gemmini.tile_matmul %a %b %c %d {act = 4, bertScale = 0.05:f32, dataflow = 1} :
+      memref<31x30xi8> memref<30x66xi8> memref<31x66xi8> memref<31x66xi32>
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/logs/conv1-bad-buddy.log b/experiments/buddy-benchmarks/logs/conv1-bad-buddy.log
new file mode 100644
index 0000000..a0d6247
--- /dev/null
+++ b/experiments/buddy-benchmarks/logs/conv1-bad-buddy.log
@@ -0,0 +1,9 @@
+=== ResNet50 Conv1 - BAD Buddy MLIR (INTENTIONAL WRONG STRIDE) ===
+This should produce WRONG checksum to verify our test methodology
+
+BAD Buddy conv1 cycles: 7082
+Output checksum: 89685778
+(This should NOT match the Gemmini C reference!)
+=== BAD Conv1 DONE ===
+Gemmini extension configured with:
+    dim = 16
diff --git a/experiments/buddy-benchmarks/logs/conv1-buddy.log b/experiments/buddy-benchmarks/logs/conv1-buddy.log
new file mode 100644
index 0000000..89a41db
--- /dev/null
+++ b/experiments/buddy-benchmarks/logs/conv1-buddy.log
@@ -0,0 +1,14 @@
+=== ResNet50 Conv1 - Buddy MLIR ===
+Input: 4 x 224 x 224 x 3
+Kernel: 7 x 7, stride=2, padding=3
+Output (after pool): 4 x 56 x 56 x 64
+Input checksum: 3461497
+Weight checksum: -199
+Bias checksum: 110400
+Buddy conv1 cycles: 7313
+Output checksum: 10206332
+Output elements: 802816
+First 10 output values: 11 21 0 28 26 31 8 12 27 10 
+=== Conv1 Buddy MLIR DONE ===
+Gemmini extension configured with:
+    dim = 16
diff --git a/experiments/buddy-benchmarks/logs/conv1-gemmini.log b/experiments/buddy-benchmarks/logs/conv1-gemmini.log
new file mode 100644
index 0000000..bd2450a
--- /dev/null
+++ b/experiments/buddy-benchmarks/logs/conv1-gemmini.log
@@ -0,0 +1,16 @@
+=== ResNet50 Conv1 - Gemmini C Reference ===
+Input: 4 x 224 x 224 x 3
+Kernel: 7 x 7, stride=2, padding=3
+Output (before pool): 4 x 112 x 112 x 64
+Pool: 3 x 3, stride=2, padding=1
+Output (after pool): 4 x 56 x 56 x 64
+Input checksum: 3461497
+Weight checksum: -199
+Bias checksum: 110400
+Conv1 cycles: 225146
+Output checksum: 10206332
+Output elements: 802816
+First 10 output values: 11 21 0 28 26 31 8 12 27 10 
+=== Conv1 Gemmini C Reference PASS ===
+Gemmini extension configured with:
+    dim = 16
diff --git a/experiments/buddy-benchmarks/resnet50/Makefile b/experiments/buddy-benchmarks/resnet50/Makefile
new file mode 100644
index 0000000..00bff2c
--- /dev/null
+++ b/experiments/buddy-benchmarks/resnet50/Makefile
@@ -0,0 +1,241 @@
+# Makefile for ResNet50 Gemmini vs Buddy-MLIR comparison
+#
+# Targets:
+#   conv1-gemmini-baremetal  - Gemmini C reference (single layer)
+#   conv1-buddy-baremetal    - Buddy MLIR (single layer)
+#   run-gemmini              - Run Gemmini C on Spike
+#   run-buddy                - Run Buddy on Spike
+#   compare                  - Run both and compare checksums
+
+# ============== Paths ==============
+RISCV ?= /home/eecs/ashvin.verma/toolchains/riscv
+BUDDY ?= /scratch/ashvin/buddy-mlir/build/bin
+PK ?= /scratch/ashvin/riscv-pk/build/pk
+SPIKE ?= $(RISCV)/bin/spike
+
+GEMMINI_ROOT := /scratch/ashvin/chipyard/generators/gemmini/software/gemmini-rocc-tests
+BENCH_COMMON := $(GEMMINI_ROOT)/riscv-tests/benchmarks/common
+GEMMINI_INCLUDE := $(GEMMINI_ROOT)/include
+IMAGENET_DIR := $(GEMMINI_ROOT)/imagenet
+
+# ============== Compilers ==============
+CC := $(RISCV)/bin/riscv64-unknown-elf-gcc
+
+# ============== Flags ==============
+CFLAGS := \
+	-DPREALLOCATE=1 \
+	-DMULTITHREAD=1 \
+	-mcmodel=medany \
+	-std=gnu99 \
+	-O2 \
+	-ffast-math \
+	-fno-common \
+	-fno-builtin-printf \
+	-fno-tree-loop-distribute-patterns \
+	-march=rv64gc -Wa,-march=rv64gc \
+	-I$(GEMMINI_ROOT)/riscv-tests \
+	-I$(GEMMINI_ROOT)/riscv-tests/env \
+	-I$(GEMMINI_ROOT) \
+	-I$(BENCH_COMMON) \
+	-I$(GEMMINI_INCLUDE) \
+	-I$(IMAGENET_DIR) \
+	-Wno-incompatible-pointer-types
+
+CFLAGS_BAREMETAL := \
+	$(CFLAGS) \
+	-nostdlib \
+	-nostartfiles \
+	-static \
+	-T $(BENCH_COMMON)/test.ld \
+	-DBAREMETAL=1
+
+CFLAGS_PK := \
+	$(CFLAGS) \
+	-static \
+	-DBAREMETAL=1
+
+LIBS := -lm -lgcc
+
+# Benchmark common sources
+BENCH_SRCS := $(wildcard $(BENCH_COMMON)/*.c) $(wildcard $(BENCH_COMMON)/*.S)
+
+# ============== Buddy MLIR passes ==============
+BUDDY_OPT_FLAGS := \
+	-lower-gemmini \
+	-convert-scf-to-cf \
+	-convert-arith-to-llvm \
+	-convert-func-to-llvm \
+	-llvm-legalize-for-export
+
+BUDDY_LLC_FLAGS := \
+	-O3 \
+	-filetype=obj \
+	-mtriple=riscv64-unknown-elf \
+	-mattr=+buddyext,+d,+f,+c \
+	-float-abi=hard \
+	-code-model=medium
+
+# ============== Targets ==============
+.PHONY: all clean run-gemmini run-buddy run-bad compare validate conv2-validate
+
+all: conv1-gemmini-baremetal conv1-buddy-baremetal conv1-bad-buddy-baremetal
+
+conv2: conv2-gemmini-baremetal conv2-buddy-baremetal
+
+# ---- Gemmini C Reference ----
+conv1-gemmini-baremetal: conv1-gemmini.c
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BENCH_SRCS) $(LIBS) -o $@
+
+conv1-gemmini-pk: conv1-gemmini.c
+	$(CC) $(CFLAGS_PK) $< $(LIBS) -o $@
+
+# ---- Buddy MLIR Path ----
+# Step 1: Lower MLIR to LLVM dialect
+conv1-buddy.llvm.mlir: conv1-buddy.mlir
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+# Step 2: Translate to LLVM IR
+conv1-buddy.ll: conv1-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+# Step 3: Compile to object file
+conv1-buddy.o: conv1-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+# Step 4: Link with C harness (baremetal)
+conv1-buddy-baremetal: conv1-buddy.c conv1-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< conv1-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# Step 4 (alternate): Link with C harness (pk)
+conv1-buddy-pk: conv1-buddy.c conv1-buddy.o
+	$(CC) $(CFLAGS_PK) $< conv1-buddy.o $(LIBS) -o $@
+
+# ---- BAD Buddy MLIR Path (intentionally wrong for validation) ----
+conv1-bad-buddy.llvm.mlir: conv1-bad-buddy.mlir
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+conv1-bad-buddy.ll: conv1-bad-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+conv1-bad-buddy.o: conv1-bad-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+conv1-bad-buddy-baremetal: conv1-bad-buddy.c conv1-bad-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< conv1-bad-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ---- Conv2 (1x1 matmul) ----
+conv2-gemmini-baremetal: conv2-gemmini.c
+	$(CC) $(CFLAGS_BAREMETAL) $< $(BENCH_SRCS) $(LIBS) -o $@
+
+conv2-buddy.llvm.mlir: conv2-buddy.mlir
+	$(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@
+
+conv2-buddy.ll: conv2-buddy.llvm.mlir
+	$(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@
+
+conv2-buddy.o: conv2-buddy.ll
+	$(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@
+
+conv2-buddy-baremetal: conv2-buddy.c conv2-buddy.o
+	$(CC) $(CFLAGS_BAREMETAL) $< conv2-buddy.o $(BENCH_SRCS) $(LIBS) -o $@
+
+# ============== Run targets ==============
+run-gemmini: conv1-gemmini-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-gemmini-pk: conv1-gemmini-pk
+	$(SPIKE) --extension=gemmini $(PK) $<
+
+run-buddy: conv1-buddy-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-buddy-pk: conv1-buddy-pk
+	$(SPIKE) --extension=gemmini $(PK) $<
+
+run-bad: conv1-bad-buddy-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-conv2-gemmini: conv2-gemmini-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+run-conv2-buddy: conv2-buddy-baremetal
+	$(SPIKE) --extension=gemmini $<
+
+conv2-validate: conv2-gemmini-baremetal conv2-buddy-baremetal
+	@echo "========================================"
+	@echo "     Conv2 (1x1 matmul) Validation     "
+	@echo "========================================"
+	@echo ""
+	@echo "--- Gemmini C Reference ---"
+	@$(SPIKE) --extension=gemmini conv2-gemmini-baremetal 2>&1 | tee conv2-gemmini.log
+	@echo ""
+	@echo "--- Buddy MLIR ---"
+	@$(SPIKE) --extension=gemmini conv2-buddy-baremetal 2>&1 | tee conv2-buddy.log
+	@echo ""
+	@echo "=== Conv2 Comparison ==="
+	@GEMMINI_CKSUM=$$(grep 'Conv2 output checksum:' conv2-gemmini.log | awk '{print $$4}'); \
+	BUDDY_CKSUM=$$(grep 'Conv2 output checksum:' conv2-buddy.log | awk '{print $$4}'); \
+	echo "Gemmini C checksum: $$GEMMINI_CKSUM"; \
+	echo "Buddy checksum:     $$BUDDY_CKSUM"; \
+	if [ "$$GEMMINI_CKSUM" = "$$BUDDY_CKSUM" ]; then \
+		echo "[PASS] Conv2 checksums match"; \
+	else \
+		echo "[FAIL] Conv2 checksums do NOT match!"; \
+	fi
+
+compare: conv1-gemmini-baremetal conv1-buddy-baremetal
+	@echo "=== Running Gemmini C Reference ==="
+	@$(SPIKE) --extension=gemmini conv1-gemmini-baremetal 2>&1 | tee gemmini.log
+	@echo ""
+	@echo "=== Running Buddy MLIR ==="
+	@$(SPIKE) --extension=gemmini conv1-buddy-baremetal 2>&1 | tee buddy.log
+	@echo ""
+	@echo "=== Comparison ==="
+	@echo "Gemmini output checksum: $$(grep 'Output checksum' gemmini.log)"
+	@echo "Buddy output checksum:   $$(grep 'Output checksum' buddy.log)"
+
+# Full validation including intentional failure case
+validate: conv1-gemmini-baremetal conv1-buddy-baremetal conv1-bad-buddy-baremetal
+	@echo "========================================"
+	@echo "     Conv1 Validation Test Suite       "
+	@echo "========================================"
+	@echo ""
+	@echo "--- Test 1: Gemmini C Reference ---"
+	@$(SPIKE) --extension=gemmini conv1-gemmini-baremetal 2>&1 | tee gemmini.log
+	@GEMMINI_CKSUM=$$(grep 'Output checksum:' gemmini.log | awk '{print $$3}'); \
+	echo "Reference checksum: $$GEMMINI_CKSUM" > validation_result.txt
+	@echo ""
+	@echo "--- Test 2: Buddy MLIR (correct) ---"
+	@$(SPIKE) --extension=gemmini conv1-buddy-baremetal 2>&1 | tee buddy.log
+	@echo ""
+	@echo "--- Test 3: Buddy MLIR (INTENTIONAL BAD - wrong stride) ---"
+	@$(SPIKE) --extension=gemmini conv1-bad-buddy-baremetal 2>&1 | tee bad.log
+	@echo ""
+	@echo "========================================"
+	@echo "          VALIDATION RESULTS           "
+	@echo "========================================"
+	@GEMMINI_CKSUM=$$(grep 'Output checksum:' gemmini.log | awk '{print $$3}'); \
+	BUDDY_CKSUM=$$(grep 'Output checksum:' buddy.log | awk '{print $$3}'); \
+	BAD_CKSUM=$$(grep 'Output checksum:' bad.log | awk '{print $$3}'); \
+	echo "Gemmini C reference checksum: $$GEMMINI_CKSUM"; \
+	echo "Buddy MLIR checksum:          $$BUDDY_CKSUM"; \
+	echo "BAD Buddy checksum:           $$BAD_CKSUM"; \
+	echo ""; \
+	if [ "$$GEMMINI_CKSUM" = "$$BUDDY_CKSUM" ]; then \
+		echo "[PASS] Buddy MLIR matches Gemmini C reference"; \
+	else \
+		echo "[FAIL] Buddy MLIR does NOT match Gemmini C reference!"; \
+	fi; \
+	if [ "$$GEMMINI_CKSUM" != "$$BAD_CKSUM" ]; then \
+		echo "[PASS] BAD test correctly produces different checksum (validation works)"; \
+	else \
+		echo "[FAIL] BAD test unexpectedly matches reference (validation broken!)"; \
+	fi
+
+# ============== Clean ==============
+clean:
+	rm -f *.o *.ll *.llvm.mlir *.log validation_result.txt
+	rm -f conv1-gemmini-baremetal conv1-gemmini-pk
+	rm -f conv1-buddy-baremetal conv1-buddy-pk
+	rm -f conv1-bad-buddy-baremetal
+	rm -f conv2-gemmini-baremetal conv2-buddy-baremetal
diff --git a/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.c b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.c
new file mode 100644
index 0000000..2999c8a
--- /dev/null
+++ b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.c
@@ -0,0 +1,135 @@
+// conv1-bad-buddy.c - C harness for INTENTIONALLY WRONG Buddy-MLIR conv1
+//
+// This tests a version with wrong stride to verify checksum validation works
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdbool.h>
+
+#include "include/gemmini.h"
+#include "include/gemmini_nn.h"
+
+#include "resnet50_params.h"
+#include "images.h"
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[4];
+  int64_t strides[4];
+} MemRef4D_i8;
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[1];
+  int64_t strides[1];
+} MemRef1D_i32;
+
+// External MLIR-compiled function (BAD version with wrong stride)
+extern void _mlir_ciface_conv1_bad(MemRef4D_i8 *input, MemRef2D_i8 *weights,
+                                   MemRef1D_i32 *bias, MemRef2D_i8 *output);
+
+static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1,
+                                   int64_t d2, int64_t d3) {
+  MemRef4D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = d0;
+  ref.sizes[1] = d1;
+  ref.sizes[2] = d2;
+  ref.sizes[3] = d3;
+  ref.strides[3] = 1;
+  ref.strides[2] = d3;
+  ref.strides[1] = d2 * d3;
+  ref.strides[0] = d1 * d2 * d3;
+  return ref;
+}
+
+static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) {
+  MemRef1D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = len;
+  ref.strides[0] = 1;
+  return ref;
+}
+
+#define POOL_OUT_ROW_DIM 56
+#define POOL_OUT_COL_DIM 56
+#define BATCH_SIZE 4
+#define OUT_CHANNELS 64
+#define PATCH_SIZE 147
+
+static elem_t buddy_output[BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM][OUT_CHANNELS];
+
+int main(int argc, char *argv[]) {
+    gemmini_flush(0);
+
+    printf("=== ResNet50 Conv1 - BAD Buddy MLIR (INTENTIONAL WRONG STRIDE) ===\n");
+    printf("This should produce WRONG checksum to verify our test methodology\n\n");
+
+    memset(buddy_output, 0, sizeof(buddy_output));
+
+    MemRef4D_i8 input_ref = make_memref4_i8(
+        (elem_t*)&images[0][0][0][0],
+        BATCH_SIZE, 224, 224, 3);
+
+    MemRef2D_i8 weights_ref = make_memref2_i8(
+        (elem_t*)&conv_1_w[0][0],
+        PATCH_SIZE, OUT_CHANNELS);
+
+    MemRef1D_i32 bias_ref = make_memref1_i32(
+        (acc_t*)&conv_1_b[0],
+        OUT_CHANNELS);
+
+    MemRef2D_i8 output_ref = make_memref2_i8(
+        &buddy_output[0][0],
+        BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM,
+        OUT_CHANNELS);
+
+    uint64_t start = read_cycles();
+    _mlir_ciface_conv1_bad(&input_ref, &weights_ref, &bias_ref, &output_ref);
+    gemmini_fence();
+    uint64_t end = read_cycles();
+
+    printf("BAD Buddy conv1 cycles: %llu\n", (unsigned long long)(end - start));
+
+    long long output_checksum = 0;
+    int output_elems = BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM * OUT_CHANNELS;
+    const elem_t *output_ptr = &buddy_output[0][0];
+    for (int i = 0; i < output_elems; i++) {
+        output_checksum += output_ptr[i];
+    }
+    printf("Output checksum: %lld\n", output_checksum);
+    printf("(This should NOT match the Gemmini C reference!)\n");
+
+    printf("=== BAD Conv1 DONE ===\n");
+
+    return 0;
+}
diff --git a/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.mlir b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.mlir
new file mode 100644
index 0000000..dbd6121
--- /dev/null
+++ b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.mlir
@@ -0,0 +1,25 @@
+// conv1-bad-buddy.mlir - INTENTIONALLY WRONG to validate checksum testing
+//
+// This uses WRONG parameters (stride=1 instead of stride=2) to verify
+// that our checksum comparison can detect failures.
+
+module {
+  func.func @conv1_bad(%input: memref<4x224x224x3xi8>,
+                       %weights: memref<147x64xi8>,
+                       %bias: memref<64xi32>,
+                       %output: memref<12544x64xi8>)
+      attributes { llvm.emit_c_interface } {
+    // WRONG: Using stride=1 instead of correct stride=2
+    // This should produce a completely different (wrong) output
+    %c112 = arith.constant 112 : i64
+    %c7 = arith.constant 7 : i64
+
+    // INTENTIONAL BUG: stride=1 (should be 2)
+    gemmini.tile_conv %input %weights %bias %output %c112 %c112 %c7
+        {stride = 1, inputDilation = 1, kernelDilation = 1, padding = 3,
+         act = 1, poolSize = 3, poolStride = 2, poolPadding = 1} :
+        memref<4x224x224x3xi8> memref<147x64xi8> memref<64xi32> memref<12544x64xi8>
+        i64 i64 i64
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/resnet50/conv1-buddy.c b/experiments/buddy-benchmarks/resnet50/conv1-buddy.c
new file mode 100644
index 0000000..ab32b80
--- /dev/null
+++ b/experiments/buddy-benchmarks/resnet50/conv1-buddy.c
@@ -0,0 +1,190 @@
+// conv1-buddy.c - C harness for Buddy-MLIR ResNet50 conv_1 layer
+//
+// This harness:
+// 1. Includes the same resnet50_params.h weights as Gemmini C
+// 2. Calls the Buddy-compiled conv1 function
+// 3. Computes checksums for validation against Gemmini C reference
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdbool.h>
+
+#include "include/gemmini.h"
+#include "include/gemmini_nn.h"
+
+// Include the actual ResNet50 parameters (same weights as Gemmini C reference)
+#include "resnet50_params.h"
+#include "images.h"
+
+// Memref descriptor types for MLIR C interface
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[4];
+  int64_t strides[4];
+} MemRef4D_i8;
+
+typedef struct {
+  elem_t *basePtr;
+  elem_t *data;
+  int64_t offset;
+  int64_t sizes[2];
+  int64_t strides[2];
+} MemRef2D_i8;
+
+typedef struct {
+  acc_t *basePtr;
+  acc_t *data;
+  int64_t offset;
+  int64_t sizes[1];
+  int64_t strides[1];
+} MemRef1D_i32;
+
+// External MLIR-compiled function
+extern void _mlir_ciface_conv1(MemRef4D_i8 *input, MemRef2D_i8 *weights,
+                               MemRef1D_i32 *bias, MemRef2D_i8 *output);
+
+static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1,
+                                   int64_t d2, int64_t d3) {
+  MemRef4D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = d0;
+  ref.sizes[1] = d1;
+  ref.sizes[2] = d2;
+  ref.sizes[3] = d3;
+  ref.strides[3] = 1;
+  ref.strides[2] = d3;
+  ref.strides[1] = d2 * d3;
+  ref.strides[0] = d1 * d2 * d3;
+  return ref;
+}
+
+static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) {
+  MemRef2D_i8 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = rows;
+  ref.sizes[1] = cols;
+  ref.strides[1] = 1;
+  ref.strides[0] = cols;
+  return ref;
+}
+
+static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) {
+  MemRef1D_i32 ref;
+  ref.basePtr = base;
+  ref.data = base;
+  ref.offset = 0;
+  ref.sizes[0] = len;
+  ref.strides[0] = 1;
+  return ref;
+}
+
+// Output buffer - must be static to avoid stack overflow
+// Shape: [batch * pool_out_row * pool_out_col][out_channels] = [12544][64]
+#define POOL_OUT_ROW_DIM 56
+#define POOL_OUT_COL_DIM 56
+#define BATCH_SIZE 4
+#define OUT_CHANNELS 64
+#define PATCH_SIZE 147  // 7*7*3
+
+static elem_t buddy_output[BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM][OUT_CHANNELS];
+
+int main(int argc, char *argv[]) {
+    gemmini_flush(0);
+
+    printf("=== ResNet50 Conv1 - Buddy MLIR ===\n");
+    printf("Input: %d x %d x %d x %d\n",
+           conv_1_params.batch_size,
+           conv_1_params.in_row_dim,
+           conv_1_params.in_col_dim,
+           conv_1_params.in_channels);
+    printf("Kernel: %d x %d, stride=%d, padding=%d\n",
+           conv_1_params.kernel_size, conv_1_params.kernel_size,
+           conv_1_params.stride, conv_1_params.padding);
+    printf("Output (after pool): %d x %d x %d x %d\n",
+           conv_1_params.batch_size,
+           conv_1_params.out_dim_pooled, conv_1_params.out_dim_pooled,
+           conv_1_params.out_channels);
+
+    // Compute input checksum for verification
+    long long input_checksum = 0;
+    const elem_t *input_ptr = &images[0][0][0][0];
+    int input_elems = conv_1_params.batch_size * conv_1_params.in_row_dim *
+                      conv_1_params.in_col_dim * conv_1_params.in_channels;
+    for (int i = 0; i < input_elems; i++) {
+        input_checksum += input_ptr[i];
+    }
+    printf("Input checksum: %lld\n", input_checksum);
+
+    // Compute weight checksum
+    long long weight_checksum = 0;
+    const elem_t *weight_ptr = &conv_1_w[0][0];
+    int weight_elems = conv_1_params.patch_size * conv_1_params.out_channels;
+    for (int i = 0; i < weight_elems; i++) {
+        weight_checksum += weight_ptr[i];
+    }
+    printf("Weight checksum: %lld\n", weight_checksum);
+
+    // Compute bias checksum
+    long long bias_checksum = 0;
+    for (int i = 0; i < conv_1_params.out_channels; i++) {
+        bias_checksum += conv_1_b[i];
+    }
+    printf("Bias checksum: %lld\n", bias_checksum);
+
+    // Zero output buffer
+    memset(buddy_output, 0, sizeof(buddy_output));
+
+    // Create memref descriptors
+    MemRef4D_i8 input_ref = make_memref4_i8(
+        (elem_t*)&images[0][0][0][0],
+        BATCH_SIZE, 224, 224, 3);
+
+    MemRef2D_i8 weights_ref = make_memref2_i8(
+        (elem_t*)&conv_1_w[0][0],
+        PATCH_SIZE, OUT_CHANNELS);
+
+    MemRef1D_i32 bias_ref = make_memref1_i32(
+        (acc_t*)&conv_1_b[0],
+        OUT_CHANNELS);
+
+    MemRef2D_i8 output_ref = make_memref2_i8(
+        &buddy_output[0][0],
+        BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM,
+        OUT_CHANNELS);
+
+    // Call Buddy-compiled conv1
+    uint64_t start = read_cycles();
+    _mlir_ciface_conv1(&input_ref, &weights_ref, &bias_ref, &output_ref);
+    gemmini_fence();
+    uint64_t end = read_cycles();
+
+    printf("Buddy conv1 cycles: %llu\n", (unsigned long long)(end - start));
+
+    // Compute output checksum
+    long long output_checksum = 0;
+    int output_elems = BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM * OUT_CHANNELS;
+    const elem_t *output_ptr = &buddy_output[0][0];
+    for (int i = 0; i < output_elems; i++) {
+        output_checksum += output_ptr[i];
+    }
+    printf("Output checksum: %lld\n", output_checksum);
+    printf("Output elements: %d\n", output_elems);
+
+    // Print a few output values for debugging
+    printf("First 10 output values: ");
+    for (int i = 0; i < 10; i++) {
+        printf("%d ", output_ptr[i]);
+    }
+    printf("\n");
+
+    printf("=== Conv1 Buddy MLIR DONE ===\n");
+
+    return 0;
+}
diff --git a/experiments/buddy-benchmarks/resnet50/conv1-buddy.mlir b/experiments/buddy-benchmarks/resnet50/conv1-buddy.mlir
new file mode 100644
index 0000000..7f84ce4
--- /dev/null
+++ b/experiments/buddy-benchmarks/resnet50/conv1-buddy.mlir
@@ -0,0 +1,31 @@
+// conv1-buddy.mlir - Buddy MLIR for ResNet50 conv_1 layer
+//
+// Conv1 params: 7x7 conv, stride=2, padding=3, with 3x3 maxpool
+// Input:  4 x 224 x 224 x 3   (batch x height x width x channels)
+// Weights: 147 x 64  (patch_size=7*7*3 x out_channels)
+// Bias: 64
+// Output: 12544 x 64  (batch*pool_out_row*pool_out_col x out_channels)
+//         = 4*56*56 x 64
+
+module {
+  func.func @conv1(%input: memref<4x224x224x3xi8>,
+                   %weights: memref<147x64xi8>,
+                   %bias: memref<64xi32>,
+                   %output: memref<12544x64xi8>)
+      attributes { llvm.emit_c_interface } {
+    // out_row_dim and out_col_dim are BEFORE pooling
+    %c112 = arith.constant 112 : i64
+    %c7 = arith.constant 7 : i64
+
+    // gemmini.tile_conv: input weights bias output outRowDim outColDim kernelDim
+    // Attributes: stride, padding, act (1=ReLU), poolSize, poolStride, poolPadding
+    // scale = 1.0 / (1 << 8) = 0.00390625 (from conv_1_params.output_scale)
+    gemmini.tile_conv %input %weights %bias %output %c112 %c112 %c7
+        {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 3,
+         act = 1, poolSize = 3, poolStride = 2, poolPadding = 1,
+         scale = 0.00390625 : f32} :
+        memref<4x224x224x3xi8> memref<147x64xi8> memref<64xi32> memref<12544x64xi8>
+        i64 i64 i64
+    return
+  }
+}
diff --git a/experiments/buddy-benchmarks/resnet50/conv1-gemmini.c b/experiments/buddy-benchmarks/resnet50/conv1-gemmini.c
new file mode 100644
index 0000000..03878a4
--- /dev/null
+++ b/experiments/buddy-benchmarks/resnet50/conv1-gemmini.c
@@ -0,0 +1,126 @@
+// conv1-gemmini.c - Standalone Gemmini C test for ResNet50 conv_1 layer
+// This creates a reference checksum for validation against Buddy-MLIR
+//
+// Conv1 params: 7x7 conv, stride=2, padding=3, with 3x3 maxpool
+// Input:  4 x 224 x 224 x 3   (batch x height x width x channels)
+// Output: 4 x 56 x 56 x 64    (after conv + pool)
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdbool.h>
+
+#include "include/gemmini.h"
+#include "include/gemmini_nn.h"
+
+// Include the actual ResNet50 parameters (contains conv_1_w, conv_1_b, conv_1_params)
+#include "resnet50_params.h"
+#include "images.h"
+
+int main(int argc, char *argv[]) {
+    gemmini_flush(0);
+
+    enum tiled_matmul_type_t tiled_matmul_type = WS;
+
+    printf("=== ResNet50 Conv1 - Gemmini C Reference ===\n");
+    printf("Input: %d x %d x %d x %d\n",
+           conv_1_params.batch_size,
+           conv_1_params.in_row_dim,
+           conv_1_params.in_col_dim,
+           conv_1_params.in_channels);
+    printf("Kernel: %d x %d, stride=%d, padding=%d\n",
+           conv_1_params.kernel_size, conv_1_params.kernel_size,
+           conv_1_params.stride, conv_1_params.padding);
+    printf("Output (before pool): %d x %d x %d x %d\n",
+           conv_1_params.batch_size,
+           conv_1_params.out_row_dim, conv_1_params.out_col_dim,
+           conv_1_params.out_channels);
+    printf("Pool: %d x %d, stride=%d, padding=%d\n",
+           conv_1_params.pool_size, conv_1_params.pool_size,
+           conv_1_params.pool_stride, conv_1_params.pool_padding);
+    printf("Output (after pool): %d x %d x %d x %d\n",
+           conv_1_params.batch_size,
+           conv_1_params.out_dim_pooled, conv_1_params.out_dim_pooled,
+           conv_1_params.out_channels);
+
+    // Compute input checksum for verification
+    long long input_checksum = 0;
+    const elem_t *input_ptr = &images[0][0][0][0];
+    int input_elems = conv_1_params.batch_size * conv_1_params.in_row_dim *
+                      conv_1_params.in_col_dim * conv_1_params.in_channels;
+    for (int i = 0; i < input_elems; i++) {
+        input_checksum += input_ptr[i];
+    }
+    printf("Input checksum: %lld\n", input_checksum);
+
+    // Compute weight checksum
+    long long weight_checksum = 0;
+    const elem_t *weight_ptr = &conv_1_w[0][0];
+    int weight_elems = conv_1_params.patch_size * conv_1_params.out_channels;
+    for (int i = 0; i < weight_elems; i++) {
+        weight_checksum += weight_ptr[i];
+    }
+    printf("Weight checksum: %lld\n", weight_checksum);
+
+    // Compute bias checksum
+    long long bias_checksum = 0;
+    for (int i = 0; i < conv_1_params.out_channels; i++) {
+        bias_checksum += conv_1_b[i];
+    }
+    printf("Bias checksum: %lld\n", bias_checksum);
+
+    // Run conv_1 with tiled_conv_auto (fused conv + pool)
+    uint64_t start = read_cycles();
+
+    tiled_conv_auto(
+        conv_1_params.batch_size,
+        conv_1_params.in_row_dim, conv_1_params.in_col_dim,
+        conv_1_params.in_channels,
+        conv_1_params.out_channels,
+        conv_1_params.out_row_dim, conv_1_params.out_col_dim,
+        conv_1_params.stride,
+        1,  // input_dilation
+        1,  // kernel_dilation
+        conv_1_params.padding,
+        conv_1_params.kernel_size,
+        false, false, false, false, false,  // transposes
+        (elem_t*)images,
+        (elem_t*)conv_1_w,
+        (acc_t*)conv_1_b,
+        (elem_t*)conv_1_out_pooled,
+        RELU,
+        conv_1_params.output_scale,
+        conv_1_params.pool_size,
+        conv_1_params.pool_stride,
+        conv_1_params.pool_padding,
+        tiled_matmul_type);
+
+    gemmini_fence();
+    uint64_t end = read_cycles();
+
+    printf("Conv1 cycles: %llu\n", (unsigned long long)(end - start));
+
+    // Compute output checksum
+    long long output_checksum = 0;
+    int output_elems = conv_1_params.batch_size *
+                       conv_1_params.out_dim_pooled *
+                       conv_1_params.out_dim_pooled *
+                       conv_1_params.out_channels;
+    const elem_t *output_ptr = &conv_1_out_pooled[0][0][0][0];
+    for (int i = 0; i < output_elems; i++) {
+        output_checksum += output_ptr[i];
+    }
+    printf("Output checksum: %lld\n", output_checksum);
+    printf("Output elements: %d\n", output_elems);
+
+    // Print a few output values for debugging
+    printf("First 10 output values: ");
+    for (int i = 0; i < 10; i++) {
+        printf("%d ", output_ptr[i]);
+    }
+    printf("\n");
+
+    printf("=== Conv1 Gemmini C Reference PASS ===\n");
+
+    return 0;
+}
diff --git a/experiments/buddy-benchmarks/scripts/run_benchmark.sh b/experiments/buddy-benchmarks/scripts/run_benchmark.sh
new file mode 100755
index 0000000..13e8b9d
--- /dev/null
+++ b/experiments/buddy-benchmarks/scripts/run_benchmark.sh
@@ -0,0 +1,165 @@
+#!/usr/bin/env bash
+# run_benchmark.sh - Build and run all Buddy-MLIR Gemmini benchmarks on Spike
+#
+# Usage: ./scripts/run_benchmark.sh
+#
+# Prerequisites:
+#   - RISCV, BUDDY, SPIKE env vars set (or defaults in Makefiles)
+#   - gemmini-rocc-tests available at expected path
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+ROOT_DIR="$(dirname "$SCRIPT_DIR")"
+
+PASS=0
+FAIL=0
+TOTAL=0
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+BOLD='\033[1m'
+NC='\033[0m' # No Color
+
+log_header() {
+    echo ""
+    echo -e "${BOLD}========================================${NC}"
+    echo -e "${BOLD}  $1${NC}"
+    echo -e "${BOLD}========================================${NC}"
+    echo ""
+}
+
+log_result() {
+    local name="$1"
+    local status="$2"
+    local cycles="$3"
+    local checksum="$4"
+
+    TOTAL=$((TOTAL + 1))
+    if [ "$status" = "PASS" ]; then
+        PASS=$((PASS + 1))
+        echo -e "  ${GREEN}[PASS]${NC} $name  cycles=$cycles  checksum=$checksum"
+    else
+        FAIL=$((FAIL + 1))
+        echo -e "  ${RED}[FAIL]${NC} $name  cycles=$cycles  checksum=$checksum"
+    fi
+}
+
+# ============================================================
+# Step 1: Build kernel benchmarks
+# ============================================================
+log_header "Building kernel benchmarks"
+
+cd "$ROOT_DIR/kernels"
+make clean 2>/dev/null || true
+make all 2>&1 | tail -5
+echo "Kernel benchmarks built."
+
+# ============================================================
+# Step 2: Run kernel benchmarks on Spike
+# ============================================================
+log_header "Running kernel benchmarks on Spike"
+
+SPIKE="${SPIKE:-${RISCV:-/home/eecs/ashvin.verma/toolchains/riscv}/bin/spike}"
+
+declare -A EXPECTED_CHECKSUMS
+EXPECTED_CHECKSUMS[conv]=950
+EXPECTED_CHECKSUMS[conv-with-pool]=30827
+EXPECTED_CHECKSUMS[mlp2]=252338
+EXPECTED_CHECKSUMS[mlp2-os]=252338
+EXPECTED_CHECKSUMS[mlp1]=258664
+EXPECTED_CHECKSUMS[softmax-matmul]=3860
+EXPECTED_CHECKSUMS[igelu-matmul]=-23260
+
+for bench in conv conv-with-pool mlp2 mlp2-os mlp1 softmax-matmul igelu-matmul; do
+    if [ ! -f "${bench}-baremetal" ]; then
+        echo -e "  ${RED}[SKIP]${NC} $bench - binary not found"
+        continue
+    fi
+
+    OUTPUT=$($SPIKE --extension=gemmini "${bench}-baremetal" 2>&1) || true
+
+    # Extract cycles (look for "cycles:" in output)
+    CYCLES=$(echo "$OUTPUT" | grep -i 'cycles:' | grep -oP '\d+' | tail -1 || echo "N/A")
+
+    # Extract checksum (look for "output checksum:" in output)
+    CHECKSUM=$(echo "$OUTPUT" | grep -i 'output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A")
+
+    EXPECTED="${EXPECTED_CHECKSUMS[$bench]:-UNKNOWN}"
+    if [ "$CHECKSUM" = "$EXPECTED" ]; then
+        log_result "$bench" "PASS" "$CYCLES" "$CHECKSUM"
+    else
+        log_result "$bench" "FAIL" "$CYCLES" "$CHECKSUM (expected $EXPECTED)"
+    fi
+done
+
+# ============================================================
+# Step 3: Build and run ResNet50 validation
+# ============================================================
+log_header "Building ResNet50 validation"
+
+cd "$ROOT_DIR/resnet50"
+make clean 2>/dev/null || true
+make all 2>&1 | tail -5
+echo "ResNet50 benchmarks built."
+
+log_header "Running ResNet50 validation on Spike"
+
+# Run Gemmini C reference
+if [ -f "conv1-gemmini-baremetal" ]; then
+    OUTPUT=$($SPIKE --extension=gemmini conv1-gemmini-baremetal 2>&1) || true
+    GEMMINI_CYCLES=$(echo "$OUTPUT" | grep -i 'Conv1 cycles:' | grep -oP '\d+' | tail -1 || echo "N/A")
+    GEMMINI_CHECKSUM=$(echo "$OUTPUT" | grep -i 'Output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A")
+    echo "  Gemmini C: cycles=$GEMMINI_CYCLES checksum=$GEMMINI_CHECKSUM"
+fi
+
+# Run Buddy
+if [ -f "conv1-buddy-baremetal" ]; then
+    OUTPUT=$($SPIKE --extension=gemmini conv1-buddy-baremetal 2>&1) || true
+    BUDDY_CYCLES=$(echo "$OUTPUT" | grep -i 'conv1 cycles:' | grep -oP '\d+' | tail -1 || echo "N/A")
+    BUDDY_CHECKSUM=$(echo "$OUTPUT" | grep -i 'Output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A")
+
+    if [ "$BUDDY_CHECKSUM" = "$GEMMINI_CHECKSUM" ]; then
+        log_result "resnet50-conv1 (buddy)" "PASS" "$BUDDY_CYCLES" "$BUDDY_CHECKSUM"
+    else
+        log_result "resnet50-conv1 (buddy)" "FAIL" "$BUDDY_CYCLES" "$BUDDY_CHECKSUM (expected $GEMMINI_CHECKSUM)"
+    fi
+fi
+
+# Run BAD test (should NOT match)
+if [ -f "conv1-bad-buddy-baremetal" ]; then
+    OUTPUT=$($SPIKE --extension=gemmini conv1-bad-buddy-baremetal 2>&1) || true
+    BAD_CHECKSUM=$(echo "$OUTPUT" | grep -i 'Output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A")
+
+    TOTAL=$((TOTAL + 1))
+    if [ "$BAD_CHECKSUM" != "$GEMMINI_CHECKSUM" ]; then
+        PASS=$((PASS + 1))
+        echo -e "  ${GREEN}[PASS]${NC} resnet50-conv1 (bad) correctly differs: checksum=$BAD_CHECKSUM"
+    else
+        FAIL=$((FAIL + 1))
+        echo -e "  ${RED}[FAIL]${NC} resnet50-conv1 (bad) unexpectedly matches reference!"
+    fi
+fi
+
+# ============================================================
+# Summary
+# ============================================================
+log_header "Summary"
+
+echo "  Total tests: $TOTAL"
+echo -e "  Passed:      ${GREEN}$PASS${NC}"
+if [ "$FAIL" -gt 0 ]; then
+    echo -e "  Failed:      ${RED}$FAIL${NC}"
+else
+    echo -e "  Failed:      $FAIL"
+fi
+echo ""
+
+if [ "$FAIL" -gt 0 ]; then
+    echo -e "${RED}Some tests failed!${NC}"
+    exit 1
+else
+    echo -e "${GREEN}All tests passed.${NC}"
+    exit 0
+fi