diff --git a/experiments/buddy-benchmarks/README.md b/experiments/buddy-benchmarks/README.md new file mode 100644 index 0000000..038b974 --- /dev/null +++ b/experiments/buddy-benchmarks/README.md @@ -0,0 +1,149 @@ +# Buddy-MLIR Gemmini Performance Benchmarks + +Performance evaluation of [Buddy-MLIR](https://github.com/buddy-compiler/buddy-mlir)'s Gemmini dialect backend, +benchmarked against the Gemmini C reference implementation on Spike simulator. + +For the full lowering pipeline and setup instructions, see [WORKFLOW.md](WORKFLOW.md). + +## Performance Results + +### Matmul Workloads + +| Workload | Dataflow | Gemmini C cycles | Buddy cycles | Checksum Match | Speedup | +|----------|----------|------------------|--------------|----------------|---------| +| MLP2 (64x832) | WS | 2,528 | 409 | ✓ 252338 | 6.18x | +| MLP2 (64x832) | OS | 207,782 | 96,076 | ✓ 252338 | 2.16x | +| MLP1 (6-layer) | WS | 25,251 | 2,539 | ✓ 258664 | 9.95x | +| softmax matmul (31x30x66) | WS | 335 | 145 | ✓ 3860 | 2.31x | +| IGELU matmul (30x30x30) | WS | 133 | 133 | ✓ -23260 | 1.00x | + +### Conv Workloads + +After the conv-encoding fix ([buddy-compiler/buddy-mlir#689](https://github.com/buddy-compiler/buddy-mlir/pull/689)): + +| Workload | CPU cycles | Gemmini C cycles | Buddy cycles | Checksum Match | Buddy vs Gemmini C | +|----------|-----------|------------------|--------------|----------------|---------------------| +| conv (17x17, k=3, stride=2) | 7,559,913 | 1,027 | 149 | ✓ 950 | 6.89x | +| conv_with_pool (17x17, k=3, pool=3) | 7,714,291 | 1,605 | 172 | ✓ 30827 | 9.33x | + +### ResNet50 Layer Validation + +| Layer | Gemmini C cycles | Buddy cycles | Checksum Match | Speedup | +|-------|------------------|--------------|----------------|---------| +| Conv1 (7x7, stride=2, pool) | 225,146 | 7,313 | ✓ 10206332 | 30.8x | + +## Methodology + +- **Simulator**: Spike ISA simulator with Gemmini extension (`dim=16`) +- **Cycle measurement**: `rdcycle` instruction around the accelerator call + (between `gemmini_flush(0)` and `gemmini_fence()`) +- **Validation**: Output checksums compared between Buddy-MLIR and Gemmini C reference +- **Gemmini C reference**: `tiled_matmul_auto` / `tiled_conv_auto` from + [gemmini-rocc-tests](https://github.com/ucb-bar/gemmini-rocc-tests) + +### Important Caveats + +The `rdcycle` counter measures **CPU instructions executed**, not wall-clock time or +Gemmini hardware execution time. Buddy-MLIR's compile-time loop unrolling reduces +host-side loop overhead (fewer `rdcycle` ticks for loop control), making the cycle +counts appear faster even when the underlying Gemmini hardware work is identical. + +The speedup numbers reflect reduced host-side orchestration overhead, not necessarily +faster accelerator throughput. + +### Buddy-MLIR Conv Encoding Fix + +The conv benchmarks require the fix from [buddy-compiler/buddy-mlir#689](https://github.com/buddy-compiler/buddy-mlir/pull/689), +which corrects the `im2col` encoding for convolutions in the Gemmini lowering path. +Without this fix, conv outputs produce incorrect checksums. + +## Directory Structure + +``` +experiments/buddy-benchmarks/ +├── README.md # This file +├── scripts/ +│ └── run_benchmark.sh # Run all benchmarks on Spike +├── kernels/ +│ ├── Makefile # Build all kernel benchmarks +│ ├── conv/ +│ │ ├── conv-buddy.mlir # 17x17 conv, k=3, stride=2 +│ │ └── conv-buddy.c # C harness +│ ├── conv-with-pool/ +│ │ ├── conv-with-pool-buddy.mlir # Conv + 3x3 maxpool +│ │ └── conv-with-pool-buddy.c +│ ├── mlp2/ +│ │ ├── mlp2-buddy.mlir # 2-layer MLP (WS) +│ │ ├── mlp2-buddy-os.mlir # 2-layer MLP (OS) +│ │ └── mlp2-buddy.c +│ ├── mlp1/ +│ │ ├── mlp1-buddy.mlir # 6-layer MLP +│ │ └── mlp1-buddy.c +│ ├── softmax-matmul/ +│ │ ├── softmax-matmul-buddy.mlir +│ │ └── softmax-matmul-buddy.c +│ └── igelu-matmul/ +│ ├── igelu-matmul-buddy.mlir +│ └── igelu-matmul-buddy.c +├── resnet50/ +│ ├── Makefile # Build + validate ResNet50 conv1 +│ ├── conv1-buddy.mlir # ResNet50 conv1 (7x7, stride=2, pool) +│ ├── conv1-buddy.c # Buddy C harness +│ ├── conv1-gemmini.c # Gemmini C reference +│ ├── conv1-bad-buddy.mlir # Intentional bad case (wrong stride) +│ └── conv1-bad-buddy.c +└── logs/ # Reference Spike output logs + ├── conv1-gemmini.log + ├── conv1-buddy.log + └── conv1-bad-buddy.log +``` + +## How to Reproduce + +### Prerequisites + +- RISC-V GNU toolchain (GCC cross-compiler for `riscv64-unknown-elf`) +- [Buddy-MLIR](https://github.com/buddy-compiler/buddy-mlir) built with Gemmini dialect + (`buddy-opt`, `buddy-translate`, `buddy-llc`) +- [Spike](https://github.com/riscv-software-src/riscv-isa-sim) ISA simulator with Gemmini extension +- [gemmini-rocc-tests](https://github.com/ucb-bar/gemmini-rocc-tests) (for headers and baremetal runtime) + +### Build and Run Kernel Benchmarks + +```bash +cd experiments/buddy-benchmarks/kernels + +# Set paths (adjust to your environment) +export RISCV=/path/to/riscv-toolchain +export BUDDY=/path/to/buddy-mlir/build/bin + +# Build all benchmarks +make all + +# Run all on Spike +make run-all + +# Or run individual benchmarks +make run-conv +make run-mlp2 +make run-mlp1 +``` + +### Build and Run ResNet50 Validation + +```bash +cd experiments/buddy-benchmarks/resnet50 + +# Build all (Gemmini C reference + Buddy + intentional bad case) +make all + +# Run full validation suite (compares checksums automatically) +make validate +``` + +### Run Everything + +```bash +cd experiments/buddy-benchmarks +./scripts/run_benchmark.sh +``` diff --git a/experiments/buddy-benchmarks/WORKFLOW.md b/experiments/buddy-benchmarks/WORKFLOW.md new file mode 100644 index 0000000..49a5b87 --- /dev/null +++ b/experiments/buddy-benchmarks/WORKFLOW.md @@ -0,0 +1,356 @@ +# Buddy-MLIR Gemmini Workflow: From MLIR to Execution on Spike + +This document describes the complete pipeline for compiling Gemmini dialect MLIR +to bare-metal RISC-V and running it on the Spike ISA simulator with the Gemmini +accelerator extension. + +## Pipeline Overview + +``` + ┌──────────────────────┐ + │ Gemmini MLIR Source │ + │ (gemmini.tile_*) │ + └──────────┬───────────┘ + │ + buddy-opt --lower-gemmini + + standard MLIR passes + │ + ▼ + ┌──────────────────────┐ + │ LLVM Dialect MLIR │ + │ (.llvm.mlir) │ + └──────────┬───────────┘ + │ + buddy-translate --buddy-to-llvmir + │ + ▼ + ┌──────────────────────┐ + │ LLVM IR (.ll) │ + └──────────┬───────────┘ + │ + buddy-llc -mattr=+buddyext + -mtriple=riscv64-unknown-elf + │ + ▼ + ┌──────────────────────┐ + │ RISC-V Object (.o) │ + │ (RoCC custom insns) │ + └──────────┬───────────┘ + │ + riscv64-unknown-elf-gcc + link with C harness + + baremetal runtime + │ + ▼ + ┌──────────────────────┐ + │ Bare-metal ELF │ + └──────────┬───────────┘ + │ + spike --extension=gemmini + │ + ▼ + ┌──────────────────────┐ + │ Gemmini Simulator │ + │ Output + Cycles │ + └──────────────────────┘ +``` + +## Prerequisites + +### 1. RISC-V GNU Toolchain + +A bare-metal cross-compiler targeting `riscv64-unknown-elf`: + +```bash +# Provides: riscv64-unknown-elf-gcc, as, ld, objdump, etc. +export RISCV=/path/to/riscv-toolchain +``` + +Build from source: https://github.com/riscv-collab/riscv-gnu-toolchain +```bash +./configure --prefix=$RISCV --with-arch=rv64gc --with-abi=lp64d +make +``` + +### 2. Spike ISA Simulator (with Gemmini extension) + +Spike must be built with Gemmini support from the Chipyard repository: + +```bash +# Clone chipyard (includes Gemmini as a generator) +git clone https://github.com/ucb-bar/chipyard.git +cd chipyard && ./scripts/init-submodules-no-riscv-tools.sh + +# Build Spike with Gemmini extension +cd sims/spike +make + +# Or use a pre-built spike if available: +export SPIKE=$RISCV/bin/spike +``` + +### 3. Gemmini ROCC Tests (headers + baremetal runtime) + +The C harnesses depend on headers and the bare-metal runtime from +[gemmini-rocc-tests](https://github.com/ucb-bar/gemmini-rocc-tests): + +```bash +export GEMMINI_ROOT=/path/to/chipyard/generators/gemmini/software/gemmini-rocc-tests +``` + +Key files used: +- `include/gemmini.h` — Gemmini C API (`tiled_matmul_auto`, `tiled_conv_auto`, RoCC instruction macros) +- `include/gemmini_params.h` — Hardware parameters (DIM=16, scratchpad/accumulator sizes) +- `include/gemmini_testutils.h` — Test utilities (`read_cycles`, checksum helpers) +- `include/gemmini_nn.h` — NN layer helpers (for ResNet50 reference) +- `riscv-tests/benchmarks/common/` — Bare-metal startup code (`_start`, printf shims, syscall stubs) +- `riscv-tests/benchmarks/common/test.ld` — Linker script for bare-metal execution + +### 4. Buddy-MLIR (with Gemmini dialect) + +Build Buddy-MLIR from source with Gemmini dialect support: + +```bash +# Step 1: Build LLVM/MLIR +git clone https://github.com/buddy-compiler/buddy-mlir.git +cd buddy-mlir && git submodule update --init +mkdir llvm/build && cd llvm/build +cmake -G Ninja ../llvm \ + -DLLVM_ENABLE_PROJECTS="mlir" \ + -DLLVM_TARGETS_TO_BUILD="host;RISCV" \ + -DCMAKE_BUILD_TYPE=Release +ninja + +# Step 2: Build buddy-mlir +cd ../../ +mkdir build && cd build +cmake -G Ninja .. \ + -DMLIR_DIR=$PWD/../llvm/build/lib/cmake/mlir \ + -DLLVM_DIR=$PWD/../llvm/build/lib/cmake/llvm \ + -DCMAKE_BUILD_TYPE=Release +ninja buddy-opt buddy-translate buddy-llc + +export BUDDY=$PWD/bin +``` + +**Important:** Conv benchmarks require the fix from +[buddy-compiler/buddy-mlir#689](https://github.com/buddy-compiler/buddy-mlir/pull/689) +which corrects the `im2col` encoding in the Gemmini conv lowering. + +## Step-by-Step: Compiling a Gemmini MLIR Kernel + +Using `conv-buddy.mlir` as an example: + +### Step 1: Write the Gemmini dialect MLIR + +```mlir +// conv-buddy.mlir +module { + func.func @conv(%input: memref<2x17x17x18xi8>, + %weights: memref<162x19xi8>, + %bias: memref<19xi32>, + %output: memref<162x19xi8>) attributes { llvm.emit_c_interface } { + %c9 = arith.constant 9 : i64 + %c3 = arith.constant 3 : i64 + gemmini.tile_conv %input %weights %bias %output %c9 %c9 %c3 + {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 1, + act = 0} : + memref<2x17x17x18xi8> memref<162x19xi8> memref<19xi32> memref<162x19xi8> + i64 i64 i64 + return + } +} +``` + +The `llvm.emit_c_interface` attribute generates a `_mlir_ciface_conv` wrapper +callable from C with memref descriptor structs. + +Key `gemmini.tile_conv` operands: +- `%input` — 4D input tensor (batch × height × width × channels) +- `%weights` — 2D flattened weight matrix (patch_size × out_channels) +- `%bias` — 1D bias vector +- `%output` — 2D output matrix (n_patches × out_channels) +- `%c9 %c9` — output row/col dimensions (before pooling) +- `%c3` — kernel dimension + +Key attributes: `stride`, `padding`, `act` (0=none, 1=ReLU, 3=iGELU, 4=softmax), +`poolSize`, `poolStride`, `poolPadding`, `dataflow` (0=OS, 1=WS), `bertScale`. + +### Step 2: Lower to LLVM dialect + +```bash +buddy-opt conv-buddy.mlir \ + -lower-gemmini \ + -convert-scf-to-cf \ + -convert-arith-to-llvm \ + -convert-func-to-llvm \ + -llvm-legalize-for-export \ + -o conv-buddy.llvm.mlir +``` + +Pass breakdown: +| Pass | What it does | +|------|-------------| +| `-lower-gemmini` | `gemmini.tile_conv` → Gemmini intrinsics (`gemmini.intr.loop_conv_ws`, `gemmini.intr.config_ex`, `gemmini.intr.flush`, etc.) with pre-computed tile sizes and constant offsets | +| `-convert-scf-to-cf` | SCF control flow → branch-based control flow | +| `-convert-arith-to-llvm` | Arithmetic ops → LLVM dialect | +| `-convert-func-to-llvm` | Function signatures → LLVM calling convention | +| `-llvm-legalize-for-export` | Final cleanup for LLVM IR emission | + +### Step 3: Translate to LLVM IR + +```bash +buddy-translate conv-buddy.llvm.mlir --buddy-to-llvmir -o conv-buddy.ll +``` + +This produces standard LLVM IR with inline assembly for Gemmini's RoCC custom +instructions (encoded as `.insn r` directives for the RISC-V assembler). + +### Step 4: Compile to RISC-V object + +```bash +buddy-llc conv-buddy.ll \ + -O3 \ + -filetype=obj \ + -mtriple=riscv64-unknown-elf \ + -mattr=+buddyext,+d,+f,+c \ + -float-abi=hard \ + -code-model=medium \ + -o conv-buddy.o +``` + +Key flags: +| Flag | Why | +|------|-----| +| `-mattr=+buddyext` | Enables custom Gemmini RoCC instruction support | +| `-mattr=+d,+f,+c` | Double/float/compressed RISC-V extensions | +| `-code-model=medium` | Required for large models (avoids `R_RISCV_HI20` relocation overflow) | +| `-float-abi=hard` | Hardware floating-point ABI | + +### Step 5: Write a C harness + +The C harness provides `main()`, initializes inputs, and calls the MLIR-generated +function through the C interface: + +```c +#include "include/gemmini.h" +#include "include/gemmini_testutils.h" + +// Memref descriptor matching MLIR's C interface +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[4]; + int64_t strides[4]; +} MemRef4D_i8; + +// The MLIR-generated function (from llvm.emit_c_interface) +extern void _mlir_ciface_conv(MemRef4D_i8 *input, ...); + +int main(void) { + // Initialize inputs, call function, measure cycles with rdcycle + gemmini_flush(0); + uint64_t start = read_cycles(); + _mlir_ciface_conv(&input_ref, &weights_ref, &bias_ref, &output_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + printf("Cycles: %llu\n", (unsigned long long)(end - start)); + // Compute and print output checksum for validation +} +``` + +### Step 6: Link into bare-metal ELF + +```bash +riscv64-unknown-elf-gcc \ + -DPREALLOCATE=1 -DMULTITHREAD=1 -DBAREMETAL=1 \ + -mcmodel=medany -std=gnu99 -O2 -ffast-math \ + -fno-common -fno-builtin-printf \ + -fno-tree-loop-distribute-patterns \ + -march=rv64gc -Wa,-march=rv64gc \ + -nostdlib -nostartfiles -static \ + -T $GEMMINI_ROOT/riscv-tests/benchmarks/common/test.ld \ + -I$GEMMINI_ROOT/riscv-tests -I$GEMMINI_ROOT/riscv-tests/env \ + -I$GEMMINI_ROOT -I$GEMMINI_ROOT/include \ + -I$GEMMINI_ROOT/riscv-tests/benchmarks/common \ + conv-buddy.c conv-buddy.o \ + $GEMMINI_ROOT/riscv-tests/benchmarks/common/*.c \ + $GEMMINI_ROOT/riscv-tests/benchmarks/common/*.S \ + -lm -lgcc \ + -o conv-baremetal +``` + +The bare-metal runtime from `benchmarks/common/` provides: +- `_start` entry point and C runtime initialization +- `printf` via HTIF (Host-Target Interface) syscalls +- Memory management stubs + +### Step 7: Run on Spike + +```bash +spike --extension=gemmini conv-baremetal +``` + +Spike simulates the RISC-V core with the Gemmini systolic array extension +(default config: 16×16 PEs, weight-stationary dataflow). The `--extension=gemmini` +flag loads the Gemmini functional model that intercepts RoCC custom instructions. + +Example output: +``` +Buddy conv cycles: 149 +Buddy conv output checksum: 950 +Gemmini extension configured with: + dim = 16 +``` + +## Using the Makefiles + +Instead of running each step manually, use the provided Makefiles: + +```bash +# Kernel benchmarks (conv, mlp, etc.) +cd experiments/buddy-benchmarks/kernels +make conv-baremetal # Build one benchmark +make all # Build all benchmarks +make run-conv # Build + run on Spike +make run-all # Run everything + +# ResNet50 layer validation +cd experiments/buddy-benchmarks/resnet50 +make all # Build Gemmini C ref + Buddy + bad case +make validate # Run all three and compare checksums + +# Or run the full suite: +cd experiments/buddy-benchmarks +./scripts/run_benchmark.sh +``` + +## Why Buddy-MLIR Shows Fewer Cycles + +The `rdcycle` instruction counts **CPU instructions executed**, not Gemmini +hardware cycles. Buddy-MLIR's lowering pre-computes tile sizes, loop bounds, and +memory offsets at compile time, emitting a flat sequence of Gemmini intrinsic calls +with constant arguments. In contrast, the Gemmini C reference (`tiled_matmul_auto`) +performs runtime tile-size search, per-tile address arithmetic, and loop iteration — +all of which execute on the CPU and inflate the `rdcycle` count. + +The underlying Gemmini hardware work (systolic array compute, DMA transfers) is +the same in both cases. The speedup reflects reduced **host-side orchestration +overhead**, not faster accelerator throughput. This advantage would still manifest +on real hardware, since the CPU is freed up sooner for other work. + +## Gemmini MLIR Operations Reference + +| Operation | Description | Key Attributes | +|-----------|-------------|----------------| +| `gemmini.tile_matmul` | Tiled matrix multiply | `dataflow` (0=OS, 1=WS), `act` (0/1/3/4) | +| `gemmini.tile_conv` | Tiled convolution (im2col) | `stride`, `padding`, `poolSize`, `poolStride`, `act` | +| `gemmini.intr.flush` | Flush Gemmini command queue | — | +| `gemmini.intr.config_ex` | Configure execution mode | dataflow, activation, scale | +| `gemmini.intr.loop_ws` | Weight-stationary matmul loop | tile dimensions, addresses | +| `gemmini.intr.loop_conv_ws` | Weight-stationary conv loop | conv parameters, addresses | + +Activation functions: 0=none, 1=ReLU, 3=iGELU, 4=softmax + +Dataflows: 0=output-stationary (accumulates in place), 1=weight-stationary (keeps weights in scratchpad) diff --git a/experiments/buddy-benchmarks/kernels/Makefile b/experiments/buddy-benchmarks/kernels/Makefile new file mode 100644 index 0000000..c54f737 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/Makefile @@ -0,0 +1,225 @@ +# Makefile for Buddy-MLIR Gemmini kernel benchmarks +# +# Builds MLIR kernels through buddy-opt -> buddy-translate -> buddy-llc, +# then links with C harnesses against gemmini-rocc-tests baremetal runtime. +# +# Targets: +# all - Build all kernel benchmarks +# run-all - Run all on Spike +# -baremetal - Build a specific benchmark +# run- - Run a specific benchmark on Spike +# clean - Remove build artifacts + +# ============== Paths ============== +RISCV ?= /home/eecs/ashvin.verma/toolchains/riscv +BUDDY ?= /scratch/ashvin/buddy-mlir/build/bin +PK ?= /scratch/ashvin/riscv-pk/build/pk +SPIKE ?= $(RISCV)/bin/spike + +GEMMINI_ROOT := /scratch/ashvin/chipyard/generators/gemmini/software/gemmini-rocc-tests +BENCH_COMMON := $(GEMMINI_ROOT)/riscv-tests/benchmarks/common +GEMMINI_INCLUDE := $(GEMMINI_ROOT)/include +MLP_DIR := $(GEMMINI_ROOT)/mlps + +# ============== Compilers ============== +CC := $(RISCV)/bin/riscv64-unknown-elf-gcc + +# ============== Flags ============== +CFLAGS := \ + -DPREALLOCATE=1 \ + -DMULTITHREAD=1 \ + -mcmodel=medany \ + -std=gnu99 \ + -O2 \ + -ffast-math \ + -fno-common \ + -fno-builtin-printf \ + -fno-tree-loop-distribute-patterns \ + -march=rv64gc -Wa,-march=rv64gc \ + -I$(GEMMINI_ROOT)/riscv-tests \ + -I$(GEMMINI_ROOT)/riscv-tests/env \ + -I$(GEMMINI_ROOT) \ + -I$(BENCH_COMMON) \ + -I$(GEMMINI_INCLUDE) \ + -I$(MLP_DIR) \ + -Wno-incompatible-pointer-types + +CFLAGS_BAREMETAL := \ + $(CFLAGS) \ + -nostdlib \ + -nostartfiles \ + -static \ + -T $(BENCH_COMMON)/test.ld \ + -DBAREMETAL=1 + +LIBS := -lm -lgcc + +# Benchmark common sources +BENCH_SRCS := $(wildcard $(BENCH_COMMON)/*.c) $(wildcard $(BENCH_COMMON)/*.S) + +# ============== Buddy MLIR passes ============== +BUDDY_OPT_FLAGS := \ + -lower-gemmini \ + -convert-scf-to-cf \ + -convert-arith-to-llvm \ + -convert-func-to-llvm \ + -llvm-legalize-for-export + +BUDDY_LLC_FLAGS := \ + -O3 \ + -filetype=obj \ + -mtriple=riscv64-unknown-elf \ + -mattr=+buddyext,+d,+f,+c \ + -float-abi=hard \ + -code-model=medium + +# ============== Benchmark definitions ============== +# Each benchmark: (name, mlir-dir, mlir-file, c-file, func-name) +BENCHMARKS := conv conv-with-pool mlp2 mlp2-os mlp1 softmax-matmul igelu-matmul + +# Build directory for intermediate artifacts +BUILD := build + +# ============== Targets ============== +.PHONY: all clean run-all $(addprefix run-,$(BENCHMARKS)) + +all: $(addsuffix -baremetal,$(BENCHMARKS)) + +# ---- Generic MLIR compilation rules ---- +# Pattern: build/.llvm.mlir from /.mlir +$(BUILD)/%.llvm.mlir: | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/%.ll: $(BUILD)/%.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/%.o: $(BUILD)/%.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +$(BUILD): + mkdir -p $(BUILD) + +# ---- conv ---- +$(BUILD)/conv-buddy.llvm.mlir: conv/conv-buddy.mlir | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/conv-buddy.ll: $(BUILD)/conv-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/conv-buddy.o: $(BUILD)/conv-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +conv-baremetal: conv/conv-buddy.c $(BUILD)/conv-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/conv-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ---- conv-with-pool ---- +$(BUILD)/conv-with-pool-buddy.llvm.mlir: conv-with-pool/conv-with-pool-buddy.mlir | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/conv-with-pool-buddy.ll: $(BUILD)/conv-with-pool-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/conv-with-pool-buddy.o: $(BUILD)/conv-with-pool-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +conv-with-pool-baremetal: conv-with-pool/conv-with-pool-buddy.c $(BUILD)/conv-with-pool-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/conv-with-pool-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ---- mlp2 (weight-stationary) ---- +$(BUILD)/mlp2-buddy.llvm.mlir: mlp2/mlp2-buddy.mlir | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/mlp2-buddy.ll: $(BUILD)/mlp2-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/mlp2-buddy.o: $(BUILD)/mlp2-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +mlp2-baremetal: mlp2/mlp2-buddy.c $(BUILD)/mlp2-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/mlp2-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ---- mlp2-os (output-stationary) ---- +$(BUILD)/mlp2-buddy-os.llvm.mlir: mlp2/mlp2-buddy-os.mlir | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/mlp2-buddy-os.ll: $(BUILD)/mlp2-buddy-os.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/mlp2-buddy-os.o: $(BUILD)/mlp2-buddy-os.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +mlp2-os-baremetal: mlp2/mlp2-buddy.c $(BUILD)/mlp2-buddy-os.o + $(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/mlp2-buddy-os.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ---- mlp1 (6-layer) ---- +$(BUILD)/mlp1-buddy.llvm.mlir: mlp1/mlp1-buddy.mlir | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/mlp1-buddy.ll: $(BUILD)/mlp1-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/mlp1-buddy.o: $(BUILD)/mlp1-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +mlp1-baremetal: mlp1/mlp1-buddy.c $(BUILD)/mlp1-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/mlp1-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ---- softmax-matmul ---- +$(BUILD)/softmax-matmul-buddy.llvm.mlir: softmax-matmul/softmax-matmul-buddy.mlir | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/softmax-matmul-buddy.ll: $(BUILD)/softmax-matmul-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/softmax-matmul-buddy.o: $(BUILD)/softmax-matmul-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +softmax-matmul-baremetal: softmax-matmul/softmax-matmul-buddy.c $(BUILD)/softmax-matmul-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/softmax-matmul-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ---- igelu-matmul ---- +$(BUILD)/igelu-matmul-buddy.llvm.mlir: igelu-matmul/igelu-matmul-buddy.mlir | $(BUILD) + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +$(BUILD)/igelu-matmul-buddy.ll: $(BUILD)/igelu-matmul-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +$(BUILD)/igelu-matmul-buddy.o: $(BUILD)/igelu-matmul-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +igelu-matmul-baremetal: igelu-matmul/igelu-matmul-buddy.c $(BUILD)/igelu-matmul-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< $(BUILD)/igelu-matmul-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ============== Run targets ============== +run-conv: conv-baremetal + $(SPIKE) --extension=gemmini $< + +run-conv-with-pool: conv-with-pool-baremetal + $(SPIKE) --extension=gemmini $< + +run-mlp2: mlp2-baremetal + $(SPIKE) --extension=gemmini $< + +run-mlp2-os: mlp2-os-baremetal + $(SPIKE) --extension=gemmini $< + +run-mlp1: mlp1-baremetal + $(SPIKE) --extension=gemmini $< + +run-softmax-matmul: softmax-matmul-baremetal + $(SPIKE) --extension=gemmini $< + +run-igelu-matmul: igelu-matmul-baremetal + $(SPIKE) --extension=gemmini $< + +run-all: $(addsuffix -baremetal,$(BENCHMARKS)) + @for bench in $(BENCHMARKS); do \ + echo "=== Running $$bench ==="; \ + $(SPIKE) --extension=gemmini $${bench}-baremetal 2>&1; \ + echo ""; \ + done + +# ============== Clean ============== +clean: + rm -rf $(BUILD) + rm -f $(addsuffix -baremetal,$(BENCHMARKS)) diff --git a/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.c b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.c new file mode 100644 index 0000000..12e3ecc --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.c @@ -0,0 +1,186 @@ +#include +#include +#include +#include + +#include "include/gemmini.h" +#include "include/gemmini_testutils.h" + +#define IN_ROW_DIM 17 +#define IN_COL_DIM 17 +#define IN_CHANNELS 18 +#define OUT_CHANNELS 19 +#define BATCH_SIZE 2 +#define KERNEL_DIM 3 +#define PADDING 1 +#define STRIDE 2 + +#define POOL_SIZE 3 +#define POOL_STRIDE 2 +#define POOL_PADDING 1 + +#define OUT_ROW_DIM ((IN_ROW_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1) +#define OUT_COL_DIM ((IN_COL_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1) +#define PATCH_SIZE (KERNEL_DIM * KERNEL_DIM * IN_CHANNELS) +#define N_PATCHES (BATCH_SIZE * OUT_ROW_DIM * OUT_COL_DIM) + +#define POOL_OUT_ROW_DIM ((OUT_ROW_DIM + 2 * POOL_PADDING - POOL_SIZE) / POOL_STRIDE + 1) +#define POOL_OUT_COL_DIM ((OUT_COL_DIM + 2 * POOL_PADDING - POOL_SIZE) / POOL_STRIDE + 1) + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[4]; + int64_t strides[4]; +} MemRef4D_i8; + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[1]; + int64_t strides[1]; +} MemRef1D_i32; + +extern void _mlir_ciface_conv_with_pool(MemRef4D_i8 *input, MemRef2D_i8 *weights, + MemRef1D_i32 *bias, MemRef2D_i8 *output); + +static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1, + int64_t d2, int64_t d3) { + MemRef4D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = d0; + ref.sizes[1] = d1; + ref.sizes[2] = d2; + ref.sizes[3] = d3; + ref.strides[3] = 1; + ref.strides[2] = d3; + ref.strides[1] = d2 * d3; + ref.strides[0] = d1 * d2 * d3; + return ref; +} + +static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) { + MemRef1D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = len; + ref.strides[0] = 1; + return ref; +} + +static void init_random(elem_t *buf, int len) { + for (elem_t *ptr = buf; ptr < buf + len; ptr++) { + *ptr = (rand() % 5) - 2; + } +} + +static void init_random_acc(acc_t *buf, int len) { + for (acc_t *ptr = buf; ptr < buf + len; ptr++) { + *ptr = (rand() % 5) - 2; + } +} + +static void flatten_weights(int out_channels, int kernel_dim, int in_channels, + int patch_size, + elem_t weights[out_channels][kernel_dim][kernel_dim][in_channels], + elem_t weights_mat[patch_size][out_channels]) { + assert(patch_size == kernel_dim * kernel_dim * in_channels); + for (int outc = 0; outc < out_channels; outc++) { + for (int krow = 0; krow < kernel_dim; krow++) { + for (int kcol = 0; kcol < kernel_dim; kcol++) { + for (int inc = 0; inc < in_channels; inc++) { + int wmatrow = krow * kernel_dim * in_channels + + kcol * in_channels + inc; + weights_mat[wmatrow][outc] = weights[outc][krow][kcol][inc]; + } + } + } + } +} + +int main(void) { + static elem_t input[BATCH_SIZE][IN_ROW_DIM][IN_COL_DIM][IN_CHANNELS]; + static elem_t weights[OUT_CHANNELS][KERNEL_DIM][KERNEL_DIM][IN_CHANNELS]; + static acc_t bias[OUT_CHANNELS]; + static elem_t weights_mat[PATCH_SIZE][OUT_CHANNELS]; + static elem_t pool_output_mat[BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM][OUT_CHANNELS]; + + init_random(&input[0][0][0][0], sizeof(input) / sizeof(elem_t)); + init_random(&weights[0][0][0][0], sizeof(weights) / sizeof(elem_t)); + init_random_acc(&bias[0], sizeof(bias) / sizeof(acc_t)); + flatten_weights(OUT_CHANNELS, KERNEL_DIM, IN_CHANNELS, PATCH_SIZE, + weights, weights_mat); + + long long input_checksum = 0; + elem_t *input_ptr = &input[0][0][0][0]; + int input_elems = BATCH_SIZE * IN_ROW_DIM * IN_COL_DIM * IN_CHANNELS; + for (int i = 0; i < input_elems; ++i) { + input_checksum += input_ptr[i]; + } + long long weight_checksum = 0; + elem_t *weight_ptr = &weights[0][0][0][0]; + int weight_elems = OUT_CHANNELS * KERNEL_DIM * KERNEL_DIM * IN_CHANNELS; + for (int i = 0; i < weight_elems; ++i) { + weight_checksum += weight_ptr[i]; + } + long long bias_checksum = 0; + for (int i = 0; i < OUT_CHANNELS; ++i) { + bias_checksum += bias[i]; + } + printf("Input checksum: %lld\n", input_checksum); + printf("Weights checksum: %lld\n", weight_checksum); + printf("Bias checksum: %lld\n", bias_checksum); + + MemRef4D_i8 input_ref = + make_memref4_i8(&input[0][0][0][0], BATCH_SIZE, IN_ROW_DIM, IN_COL_DIM, + IN_CHANNELS); + MemRef2D_i8 weights_ref = + make_memref2_i8(&weights_mat[0][0], PATCH_SIZE, OUT_CHANNELS); + MemRef1D_i32 bias_ref = make_memref1_i32(&bias[0], OUT_CHANNELS); + MemRef2D_i8 output_ref = + make_memref2_i8(&pool_output_mat[0][0], + BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM, + OUT_CHANNELS); + + gemmini_flush(0); + uint64_t start = read_cycles(); + _mlir_ciface_conv_with_pool(&input_ref, &weights_ref, &bias_ref, &output_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Buddy conv_with_pool cycles: %llu\n", + (unsigned long long)(end - start)); + long long checksum = 0; + for (int i = 0; i < BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM; ++i) { + for (int j = 0; j < OUT_CHANNELS; ++j) { + checksum += pool_output_mat[i][j]; + } + } + printf("Buddy conv_with_pool output checksum: %lld\n", checksum); + return 0; +} diff --git a/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.mlir b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.mlir new file mode 100644 index 0000000..15383e4 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/conv-with-pool/conv-with-pool-buddy.mlir @@ -0,0 +1,15 @@ +module { + func.func @conv_with_pool(%input: memref<2x17x17x18xi8>, + %weights: memref<162x19xi8>, + %bias: memref<19xi32>, + %output: memref<50x19xi8>) attributes { llvm.emit_c_interface } { + %c9 = arith.constant 9 : i64 + %c3 = arith.constant 3 : i64 + gemmini.tile_conv %input %weights %bias %output %c9 %c9 %c3 + {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 1, + act = 0, poolSize = 3, poolStride = 2, poolPadding = 1} : + memref<2x17x17x18xi8> memref<162x19xi8> memref<19xi32> memref<50x19xi8> + i64 i64 i64 + return + } +} diff --git a/experiments/buddy-benchmarks/kernels/conv/conv-buddy.c b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.c new file mode 100644 index 0000000..4646503 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.c @@ -0,0 +1,177 @@ +#include +#include +#include +#include + +#include "include/gemmini.h" +#include "include/gemmini_testutils.h" + +#define IN_ROW_DIM 17 +#define IN_COL_DIM 17 +#define IN_CHANNELS 18 +#define OUT_CHANNELS 19 +#define BATCH_SIZE 2 +#define KERNEL_DIM 3 +#define PADDING 1 +#define STRIDE 2 + +#define OUT_ROW_DIM ((IN_ROW_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1) +#define OUT_COL_DIM ((IN_COL_DIM + 2 * PADDING - KERNEL_DIM) / STRIDE + 1) +#define PATCH_SIZE (KERNEL_DIM * KERNEL_DIM * IN_CHANNELS) +#define N_PATCHES (BATCH_SIZE * OUT_ROW_DIM * OUT_COL_DIM) + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[4]; + int64_t strides[4]; +} MemRef4D_i8; + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[1]; + int64_t strides[1]; +} MemRef1D_i32; + +extern void _mlir_ciface_conv(MemRef4D_i8 *input, MemRef2D_i8 *weights, + MemRef1D_i32 *bias, MemRef2D_i8 *output); + +static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1, + int64_t d2, int64_t d3) { + MemRef4D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = d0; + ref.sizes[1] = d1; + ref.sizes[2] = d2; + ref.sizes[3] = d3; + ref.strides[3] = 1; + ref.strides[2] = d3; + ref.strides[1] = d2 * d3; + ref.strides[0] = d1 * d2 * d3; + return ref; +} + +static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) { + MemRef1D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = len; + ref.strides[0] = 1; + return ref; +} + +static void init_random(elem_t *buf, int len) { + for (elem_t *ptr = buf; ptr < buf + len; ptr++) { + *ptr = (rand() % 5) - 2; + } +} + +static void init_random_acc(acc_t *buf, int len) { + for (acc_t *ptr = buf; ptr < buf + len; ptr++) { + *ptr = (rand() % 5) - 2; + } +} + +static void flatten_weights(int out_channels, int kernel_dim, int in_channels, + int patch_size, + elem_t weights[out_channels][kernel_dim][kernel_dim][in_channels], + elem_t weights_mat[patch_size][out_channels]) { + assert(patch_size == kernel_dim * kernel_dim * in_channels); + for (int outc = 0; outc < out_channels; outc++) { + for (int krow = 0; krow < kernel_dim; krow++) { + for (int kcol = 0; kcol < kernel_dim; kcol++) { + for (int inc = 0; inc < in_channels; inc++) { + int wmatrow = krow * kernel_dim * in_channels + + kcol * in_channels + inc; + weights_mat[wmatrow][outc] = weights[outc][krow][kcol][inc]; + } + } + } + } +} + +int main(void) { + static elem_t input[BATCH_SIZE][IN_ROW_DIM][IN_COL_DIM][IN_CHANNELS]; + static elem_t weights[OUT_CHANNELS][KERNEL_DIM][KERNEL_DIM][IN_CHANNELS]; + static acc_t bias[OUT_CHANNELS]; + static elem_t weights_mat[PATCH_SIZE][OUT_CHANNELS]; + static elem_t output_mat[N_PATCHES][OUT_CHANNELS]; + + init_random(&input[0][0][0][0], sizeof(input) / sizeof(elem_t)); + init_random(&weights[0][0][0][0], sizeof(weights) / sizeof(elem_t)); + init_random_acc(&bias[0], sizeof(bias) / sizeof(acc_t)); + flatten_weights(OUT_CHANNELS, KERNEL_DIM, IN_CHANNELS, PATCH_SIZE, + weights, weights_mat); + + long long input_checksum = 0; + elem_t *input_ptr = &input[0][0][0][0]; + int input_elems = BATCH_SIZE * IN_ROW_DIM * IN_COL_DIM * IN_CHANNELS; + for (int i = 0; i < input_elems; ++i) { + input_checksum += input_ptr[i]; + } + long long weight_checksum = 0; + elem_t *weight_ptr = &weights[0][0][0][0]; + int weight_elems = OUT_CHANNELS * KERNEL_DIM * KERNEL_DIM * IN_CHANNELS; + for (int i = 0; i < weight_elems; ++i) { + weight_checksum += weight_ptr[i]; + } + long long bias_checksum = 0; + for (int i = 0; i < OUT_CHANNELS; ++i) { + bias_checksum += bias[i]; + } + printf("Input checksum: %lld\n", input_checksum); + printf("Weights checksum: %lld\n", weight_checksum); + printf("Bias checksum: %lld\n", bias_checksum); + + MemRef4D_i8 input_ref = + make_memref4_i8(&input[0][0][0][0], BATCH_SIZE, IN_ROW_DIM, IN_COL_DIM, + IN_CHANNELS); + MemRef2D_i8 weights_ref = + make_memref2_i8(&weights_mat[0][0], PATCH_SIZE, OUT_CHANNELS); + MemRef1D_i32 bias_ref = make_memref1_i32(&bias[0], OUT_CHANNELS); + MemRef2D_i8 output_ref = + make_memref2_i8(&output_mat[0][0], N_PATCHES, OUT_CHANNELS); + + gemmini_flush(0); + uint64_t start = read_cycles(); + _mlir_ciface_conv(&input_ref, &weights_ref, &bias_ref, &output_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Buddy conv cycles: %llu\n", + (unsigned long long)(end - start)); + long long checksum = 0; + for (int i = 0; i < N_PATCHES; ++i) { + for (int j = 0; j < OUT_CHANNELS; ++j) { + checksum += output_mat[i][j]; + } + } + printf("Buddy conv output checksum: %lld\n", checksum); + return 0; +} diff --git a/experiments/buddy-benchmarks/kernels/conv/conv-buddy.mlir b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.mlir new file mode 100644 index 0000000..ba91a8d --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/conv/conv-buddy.mlir @@ -0,0 +1,15 @@ +module { + func.func @conv(%input: memref<2x17x17x18xi8>, + %weights: memref<162x19xi8>, + %bias: memref<19xi32>, + %output: memref<162x19xi8>) attributes { llvm.emit_c_interface } { + %c9 = arith.constant 9 : i64 + %c3 = arith.constant 3 : i64 + gemmini.tile_conv %input %weights %bias %output %c9 %c9 %c3 + {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 1, + act = 0} : + memref<2x17x17x18xi8> memref<162x19xi8> memref<19xi32> memref<162x19xi8> + i64 i64 i64 + return + } +} diff --git a/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.c b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.c new file mode 100644 index 0000000..07832f0 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.c @@ -0,0 +1,121 @@ +#include +#include + +#include "include/gemmini.h" +#include "include/gemmini_testutils.h" + +#define MAT_DIM_I 30 +#define MAT_DIM_K 30 +#define MAT_DIM_J 30 + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i32; + +extern void _mlir_ciface_igelu_matmul(MemRef2D_i8 *a, MemRef2D_i8 *b, + MemRef2D_i8 *c, MemRef2D_i32 *d); + +static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) { + MemRef2D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +int main(void) { + static elem_t full_A[MAT_DIM_I][MAT_DIM_K] row_align(1); + static elem_t full_B[MAT_DIM_K][MAT_DIM_J] row_align(1); + static elem_t full_C[MAT_DIM_I][MAT_DIM_J] row_align(1); + static acc_t full_D[MAT_DIM_I][MAT_DIM_J] row_align_acc(1); + + for (size_t i = 0; i < MAT_DIM_I; ++i) { + for (size_t j = 0; j < MAT_DIM_K; ++j) { + full_A[i][j] = (rand() % 3) - 1; + } + } + + for (size_t i = 0; i < MAT_DIM_K; ++i) { + for (size_t j = 0; j < MAT_DIM_J; ++j) { + full_B[i][j] = (rand() % 3) - 1; + } + } + + for (size_t i = 0; i < MAT_DIM_I; ++i) { + for (size_t j = 0; j < MAT_DIM_J; ++j) { + full_D[i][j] = 0; + } + } + + long long a_checksum = 0; + elem_t *a_ptr = &full_A[0][0]; + int a_elems = MAT_DIM_I * MAT_DIM_K; + for (int i = 0; i < a_elems; ++i) { + a_checksum += a_ptr[i]; + } + long long b_checksum = 0; + elem_t *b_ptr = &full_B[0][0]; + int b_elems = MAT_DIM_K * MAT_DIM_J; + for (int i = 0; i < b_elems; ++i) { + b_checksum += b_ptr[i]; + } + long long d_checksum = 0; + acc_t *d_ptr = &full_D[0][0]; + int d_elems = MAT_DIM_I * MAT_DIM_J; + for (int i = 0; i < d_elems; ++i) { + d_checksum += d_ptr[i]; + } + printf("A checksum: %lld\n", a_checksum); + printf("B checksum: %lld\n", b_checksum); + printf("D checksum: %lld\n", d_checksum); + + MemRef2D_i8 a_ref = make_memref_i8(&full_A[0][0], MAT_DIM_I, MAT_DIM_K); + MemRef2D_i8 b_ref = make_memref_i8(&full_B[0][0], MAT_DIM_K, MAT_DIM_J); + MemRef2D_i8 c_ref = make_memref_i8(&full_C[0][0], MAT_DIM_I, MAT_DIM_J); + MemRef2D_i32 d_ref = make_memref_i32(&full_D[0][0], MAT_DIM_I, MAT_DIM_J); + + gemmini_flush(0); + uint64_t start = read_cycles(); + _mlir_ciface_igelu_matmul(&a_ref, &b_ref, &c_ref, &d_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Buddy igelu matmul cycles: %llu\n", + (unsigned long long)(end - start)); + long long c_checksum = 0; + elem_t *c_ptr = &full_C[0][0]; + int c_elems = MAT_DIM_I * MAT_DIM_J; + for (int i = 0; i < c_elems; ++i) { + c_checksum += c_ptr[i]; + } + printf("Buddy output checksum: %lld\n", c_checksum); + return 0; +} diff --git a/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.mlir b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.mlir new file mode 100644 index 0000000..74dea9f --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/igelu-matmul/igelu-matmul-buddy.mlir @@ -0,0 +1,10 @@ +module { + func.func @igelu_matmul(%a: memref<30x30xi8>, + %b: memref<30x30xi8>, + %c: memref<30x30xi8>, + %d: memref<30x30xi32>) attributes { llvm.emit_c_interface } { + gemmini.tile_matmul %a %b %c %d {act = 3, bertScale = 0.8:f32, dataflow = 1} : + memref<30x30xi8> memref<30x30xi8> memref<30x30xi8> memref<30x30xi32> + return + } +} diff --git a/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.c b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.c new file mode 100644 index 0000000..4a6630d --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.c @@ -0,0 +1,153 @@ +#include +#include +#include + +#include "include/gemmini.h" +#include "parameters1.h" + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i32; + +extern void _mlir_ciface_mlp1(MemRef2D_i8 *a0, MemRef2D_i8 *w0, + MemRef2D_i8 *c0, MemRef2D_i32 *d0, + MemRef2D_i8 *w1, MemRef2D_i8 *c1, + MemRef2D_i32 *d1, MemRef2D_i8 *w2, + MemRef2D_i8 *c2, MemRef2D_i32 *d2, + MemRef2D_i8 *w3, MemRef2D_i8 *c3, + MemRef2D_i32 *d3, MemRef2D_i8 *w4, + MemRef2D_i8 *c4, MemRef2D_i32 *d4, + MemRef2D_i8 *w5, MemRef2D_i8 *c5, + MemRef2D_i32 *d5); + +static uint32_t lcg_state = 777; +static inline elem_t next_elem(void) { + lcg_state = lcg_state * 1664525u + 1013904223u; + return (elem_t)((lcg_state >> 24) % 5) - 2; +} + +static void init_random_i8(elem_t *buf, int len) { + for (int i = 0; i < len; ++i) { + buf[i] = next_elem(); + } +} + +static inline uint64_t read_cycles(void) { + uint64_t cycles; + asm volatile("rdcycle %0" : "=r"(cycles)); + return cycles; +} + +static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) { + MemRef2D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static acc_t d0_bias[64][2560] row_align_acc(1) = {0}; +static acc_t d1_bias[64][2048] row_align_acc(1) = {0}; +static acc_t d2_bias[64][1536] row_align_acc(1) = {0}; +static acc_t d3_bias[64][1024] row_align_acc(1) = {0}; +static acc_t d4_bias[64][512] row_align_acc(1) = {0}; +static acc_t d5_bias[64][64] row_align_acc(1) = {0}; + +int main(void) { + lcg_state = 777; + init_random_i8(&input_mat[0][0], (int)(sizeof(input_mat) / sizeof(elem_t))); + init_random_i8(&weights0[0][0], (int)(sizeof(weights0) / sizeof(elem_t))); + init_random_i8(&weights1[0][0], (int)(sizeof(weights1) / sizeof(elem_t))); + init_random_i8(&weights2[0][0], (int)(sizeof(weights2) / sizeof(elem_t))); + init_random_i8(&weights3[0][0], (int)(sizeof(weights3) / sizeof(elem_t))); + init_random_i8(&weights4[0][0], (int)(sizeof(weights4) / sizeof(elem_t))); + init_random_i8(&weights5[0][0], (int)(sizeof(weights5) / sizeof(elem_t))); + + memset(inter_results0, 0, sizeof(inter_results0)); + memset(inter_results1, 0, sizeof(inter_results1)); + memset(inter_results2, 0, sizeof(inter_results2)); + memset(inter_results3, 0, sizeof(inter_results3)); + memset(inter_results4, 0, sizeof(inter_results4)); + memset(inter_results5, 0, sizeof(inter_results5)); + memset(d0_bias, 0, sizeof(d0_bias)); + memset(d1_bias, 0, sizeof(d1_bias)); + memset(d2_bias, 0, sizeof(d2_bias)); + memset(d3_bias, 0, sizeof(d3_bias)); + memset(d4_bias, 0, sizeof(d4_bias)); + memset(d5_bias, 0, sizeof(d5_bias)); + + MemRef2D_i8 a0_ref = make_memref_i8(&input_mat[0][0], 64, 832); + MemRef2D_i8 w0_ref = make_memref_i8(&weights0[0][0], 832, 2560); + MemRef2D_i8 c0_ref = make_memref_i8(&inter_results0[0][0], 64, 2560); + MemRef2D_i32 d0_ref = make_memref_i32(&d0_bias[0][0], 64, 2560); + + MemRef2D_i8 w1_ref = make_memref_i8(&weights1[0][0], 2560, 2048); + MemRef2D_i8 c1_ref = make_memref_i8(&inter_results1[0][0], 64, 2048); + MemRef2D_i32 d1_ref = make_memref_i32(&d1_bias[0][0], 64, 2048); + + MemRef2D_i8 w2_ref = make_memref_i8(&weights2[0][0], 2048, 1536); + MemRef2D_i8 c2_ref = make_memref_i8(&inter_results2[0][0], 64, 1536); + MemRef2D_i32 d2_ref = make_memref_i32(&d2_bias[0][0], 64, 1536); + + MemRef2D_i8 w3_ref = make_memref_i8(&weights3[0][0], 1536, 1024); + MemRef2D_i8 c3_ref = make_memref_i8(&inter_results3[0][0], 64, 1024); + MemRef2D_i32 d3_ref = make_memref_i32(&d3_bias[0][0], 64, 1024); + + MemRef2D_i8 w4_ref = make_memref_i8(&weights4[0][0], 1024, 512); + MemRef2D_i8 c4_ref = make_memref_i8(&inter_results4[0][0], 64, 512); + MemRef2D_i32 d4_ref = make_memref_i32(&d4_bias[0][0], 64, 512); + + MemRef2D_i8 w5_ref = make_memref_i8(&weights5[0][0], 512, 64); + MemRef2D_i8 c5_ref = make_memref_i8(&inter_results5[0][0], 64, 64); + MemRef2D_i32 d5_ref = make_memref_i32(&d5_bias[0][0], 64, 64); + + gemmini_flush(0); + + uint64_t start = read_cycles(); + _mlir_ciface_mlp1(&a0_ref, &w0_ref, &c0_ref, &d0_ref, + &w1_ref, &c1_ref, &d1_ref, + &w2_ref, &c2_ref, &d2_ref, + &w3_ref, &c3_ref, &d3_ref, + &w4_ref, &c4_ref, &d4_ref, + &w5_ref, &c5_ref, &d5_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Buddy mlp1 cycles: %llu\n", (unsigned long long)(end - start)); + long long checksum = 0; + for (int i = 0; i < 64; ++i) { + for (int j = 0; j < 64; ++j) { + checksum += inter_results5[i][j]; + } + } + printf("Buddy mlp1 output checksum: %lld\n", checksum); + return 0; +} diff --git a/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.mlir b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.mlir new file mode 100644 index 0000000..a9d9fa2 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/mlp1/mlp1-buddy.mlir @@ -0,0 +1,35 @@ +module { + func.func @mlp1(%a0: memref<64x832xi8>, + %w0: memref<832x2560xi8>, + %c0: memref<64x2560xi8>, + %d0: memref<64x2560xi32>, + %w1: memref<2560x2048xi8>, + %c1: memref<64x2048xi8>, + %d1: memref<64x2048xi32>, + %w2: memref<2048x1536xi8>, + %c2: memref<64x1536xi8>, + %d2: memref<64x1536xi32>, + %w3: memref<1536x1024xi8>, + %c3: memref<64x1024xi8>, + %d3: memref<64x1024xi32>, + %w4: memref<1024x512xi8>, + %c4: memref<64x512xi8>, + %d4: memref<64x512xi32>, + %w5: memref<512x64xi8>, + %c5: memref<64x64xi8>, + %d5: memref<64x64xi32>) attributes { llvm.emit_c_interface } { + gemmini.tile_matmul %a0 %w0 %c0 %d0 {dataflow = 1, act = 1} : + memref<64x832xi8> memref<832x2560xi8> memref<64x2560xi8> memref<64x2560xi32> + gemmini.tile_matmul %c0 %w1 %c1 %d1 {dataflow = 1, act = 1} : + memref<64x2560xi8> memref<2560x2048xi8> memref<64x2048xi8> memref<64x2048xi32> + gemmini.tile_matmul %c1 %w2 %c2 %d2 {dataflow = 1, act = 1} : + memref<64x2048xi8> memref<2048x1536xi8> memref<64x1536xi8> memref<64x1536xi32> + gemmini.tile_matmul %c2 %w3 %c3 %d3 {dataflow = 1, act = 1} : + memref<64x1536xi8> memref<1536x1024xi8> memref<64x1024xi8> memref<64x1024xi32> + gemmini.tile_matmul %c3 %w4 %c4 %d4 {dataflow = 1, act = 1} : + memref<64x1024xi8> memref<1024x512xi8> memref<64x512xi8> memref<64x512xi32> + gemmini.tile_matmul %c4 %w5 %c5 %d5 {dataflow = 1, act = 1} : + memref<64x512xi8> memref<512x64xi8> memref<64x64xi8> memref<64x64xi32> + return + } +} diff --git a/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy-os.mlir b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy-os.mlir new file mode 100644 index 0000000..34efe2a --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy-os.mlir @@ -0,0 +1,15 @@ +module { + func.func @mlp2(%a0: memref<64x832xi8>, + %w0: memref<832x832xi8>, + %c0: memref<64x832xi8>, + %d0: memref<64x832xi32>, + %w1: memref<832x64xi8>, + %c1: memref<64x64xi8>, + %d1: memref<64x64xi32>) attributes { llvm.emit_c_interface } { + gemmini.tile_matmul %a0 %w0 %c0 %d0 {dataflow = 0, act = 1} : + memref<64x832xi8> memref<832x832xi8> memref<64x832xi8> memref<64x832xi32> + gemmini.tile_matmul %c0 %w1 %c1 %d1 {dataflow = 0, act = 1} : + memref<64x832xi8> memref<832x64xi8> memref<64x64xi8> memref<64x64xi32> + return + } +} diff --git a/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.c b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.c new file mode 100644 index 0000000..a1e393d --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.c @@ -0,0 +1,117 @@ +#include +#include + +#include "include/gemmini.h" +#include "parameters2.h" + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i32; + +extern void _mlir_ciface_mlp2(MemRef2D_i8 *a0, MemRef2D_i8 *w0, + MemRef2D_i8 *c0, MemRef2D_i32 *d0, + MemRef2D_i8 *w1, MemRef2D_i8 *c1, + MemRef2D_i32 *d1); + +static uint32_t lcg_state = 777; +static inline elem_t next_elem(void) { + lcg_state = lcg_state * 1664525u + 1013904223u; + return (elem_t)((lcg_state >> 24) % 5) - 2; +} + +static void init_random_i8(elem_t *buf, int len) { + for (int i = 0; i < len; ++i) { + buf[i] = next_elem(); + } +} + +static acc_t d0_bias[64][832] row_align_acc(1) = {0}; +static acc_t d1_bias[64][64] row_align_acc(1) = {0}; + +static inline uint64_t read_cycles(void) { + uint64_t cycles; + asm volatile("rdcycle %0" : "=r"(cycles)); + return cycles; +} + +static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) { + MemRef2D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +int main(void) { + lcg_state = 777; + init_random_i8(&input_mat[0][0], (int)(sizeof(input_mat) / sizeof(elem_t))); + init_random_i8(&weights0[0][0], (int)(sizeof(weights0) / sizeof(elem_t))); + init_random_i8(&weights1[0][0], (int)(sizeof(weights1) / sizeof(elem_t))); + + for (int i = 0; i < 64; ++i) { + for (int j = 0; j < 832; ++j) { + inter_results0[i][j] = 0; + d0_bias[i][j] = 0; + } + } + for (int i = 0; i < 64; ++i) { + for (int j = 0; j < 64; ++j) { + inter_results1[i][j] = 0; + d1_bias[i][j] = 0; + } + } + + MemRef2D_i8 a0_ref = make_memref_i8(&input_mat[0][0], 64, 832); + MemRef2D_i8 w0_ref = make_memref_i8(&weights0[0][0], 832, 832); + MemRef2D_i8 c0_ref = make_memref_i8(&inter_results0[0][0], 64, 832); + MemRef2D_i32 d0_ref = make_memref_i32(&d0_bias[0][0], 64, 832); + MemRef2D_i8 w1_ref = make_memref_i8(&weights1[0][0], 832, 64); + MemRef2D_i8 c1_ref = make_memref_i8(&inter_results1[0][0], 64, 64); + MemRef2D_i32 d1_ref = make_memref_i32(&d1_bias[0][0], 64, 64); + + gemmini_flush(0); + + uint64_t start = read_cycles(); + _mlir_ciface_mlp2(&a0_ref, &w0_ref, &c0_ref, &d0_ref, + &w1_ref, &c1_ref, &d1_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Buddy mlp2 cycles: %llu\n", (unsigned long long)(end - start)); + long long checksum = 0; + for (int i = 0; i < 64; ++i) { + for (int j = 0; j < 64; ++j) { + checksum += inter_results1[i][j]; + } + } + printf("Buddy mlp2 output checksum: %lld\n", checksum); + return 0; +} diff --git a/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.mlir b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.mlir new file mode 100644 index 0000000..c513c73 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/mlp2/mlp2-buddy.mlir @@ -0,0 +1,15 @@ +module { + func.func @mlp2(%a0: memref<64x832xi8>, + %w0: memref<832x832xi8>, + %c0: memref<64x832xi8>, + %d0: memref<64x832xi32>, + %w1: memref<832x64xi8>, + %c1: memref<64x64xi8>, + %d1: memref<64x64xi32>) attributes { llvm.emit_c_interface } { + gemmini.tile_matmul %a0 %w0 %c0 %d0 {dataflow = 1, act = 1} : + memref<64x832xi8> memref<832x832xi8> memref<64x832xi8> memref<64x832xi32> + gemmini.tile_matmul %c0 %w1 %c1 %d1 {dataflow = 1, act = 1} : + memref<64x832xi8> memref<832x64xi8> memref<64x64xi8> memref<64x64xi32> + return + } +} diff --git a/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.c b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.c new file mode 100644 index 0000000..c83c179 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.c @@ -0,0 +1,121 @@ +#include +#include + +#include "include/gemmini.h" +#include "include/gemmini_testutils.h" + +#define MAT_DIM_I 31 +#define MAT_DIM_K 30 +#define MAT_DIM_J 66 + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i32; + +extern void _mlir_ciface_softmax_matmul(MemRef2D_i8 *a, MemRef2D_i8 *b, + MemRef2D_i8 *c, MemRef2D_i32 *d); + +static MemRef2D_i8 make_memref_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef2D_i32 make_memref_i32(acc_t *base, int64_t rows, int64_t cols) { + MemRef2D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +int main(void) { + static elem_t full_A[MAT_DIM_I][MAT_DIM_K] row_align(1); + static elem_t full_B[MAT_DIM_K][MAT_DIM_J] row_align(1); + static elem_t full_C[MAT_DIM_I][MAT_DIM_J] row_align(1); + static acc_t full_D[MAT_DIM_I][MAT_DIM_J] row_align_acc(1); + + for (size_t i = 0; i < MAT_DIM_I; ++i) { + for (size_t j = 0; j < MAT_DIM_K; ++j) { + full_A[i][j] = (rand() % 7) - 3; + } + } + + for (size_t i = 0; i < MAT_DIM_K; ++i) { + for (size_t j = 0; j < MAT_DIM_J; ++j) { + full_B[i][j] = (rand() % 7) - 3; + } + } + + for (size_t i = 0; i < MAT_DIM_I; ++i) { + for (size_t j = 0; j < MAT_DIM_J; ++j) { + full_D[i][j] = 0; + } + } + + long long a_checksum = 0; + elem_t *a_ptr = &full_A[0][0]; + int a_elems = MAT_DIM_I * MAT_DIM_K; + for (int i = 0; i < a_elems; ++i) { + a_checksum += a_ptr[i]; + } + long long b_checksum = 0; + elem_t *b_ptr = &full_B[0][0]; + int b_elems = MAT_DIM_K * MAT_DIM_J; + for (int i = 0; i < b_elems; ++i) { + b_checksum += b_ptr[i]; + } + long long d_checksum = 0; + acc_t *d_ptr = &full_D[0][0]; + int d_elems = MAT_DIM_I * MAT_DIM_J; + for (int i = 0; i < d_elems; ++i) { + d_checksum += d_ptr[i]; + } + printf("A checksum: %lld\n", a_checksum); + printf("B checksum: %lld\n", b_checksum); + printf("D checksum: %lld\n", d_checksum); + + MemRef2D_i8 a_ref = make_memref_i8(&full_A[0][0], MAT_DIM_I, MAT_DIM_K); + MemRef2D_i8 b_ref = make_memref_i8(&full_B[0][0], MAT_DIM_K, MAT_DIM_J); + MemRef2D_i8 c_ref = make_memref_i8(&full_C[0][0], MAT_DIM_I, MAT_DIM_J); + MemRef2D_i32 d_ref = make_memref_i32(&full_D[0][0], MAT_DIM_I, MAT_DIM_J); + + gemmini_flush(0); + uint64_t start = read_cycles(); + _mlir_ciface_softmax_matmul(&a_ref, &b_ref, &c_ref, &d_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Buddy softmax matmul cycles: %llu\n", + (unsigned long long)(end - start)); + long long c_checksum = 0; + elem_t *c_ptr = &full_C[0][0]; + int c_elems = MAT_DIM_I * MAT_DIM_J; + for (int i = 0; i < c_elems; ++i) { + c_checksum += c_ptr[i]; + } + printf("Buddy output checksum: %lld\n", c_checksum); + return 0; +} diff --git a/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.mlir b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.mlir new file mode 100644 index 0000000..d204086 --- /dev/null +++ b/experiments/buddy-benchmarks/kernels/softmax-matmul/softmax-matmul-buddy.mlir @@ -0,0 +1,10 @@ +module { + func.func @softmax_matmul(%a: memref<31x30xi8>, + %b: memref<30x66xi8>, + %c: memref<31x66xi8>, + %d: memref<31x66xi32>) attributes { llvm.emit_c_interface } { + gemmini.tile_matmul %a %b %c %d {act = 4, bertScale = 0.05:f32, dataflow = 1} : + memref<31x30xi8> memref<30x66xi8> memref<31x66xi8> memref<31x66xi32> + return + } +} diff --git a/experiments/buddy-benchmarks/logs/conv1-bad-buddy.log b/experiments/buddy-benchmarks/logs/conv1-bad-buddy.log new file mode 100644 index 0000000..a0d6247 --- /dev/null +++ b/experiments/buddy-benchmarks/logs/conv1-bad-buddy.log @@ -0,0 +1,9 @@ +=== ResNet50 Conv1 - BAD Buddy MLIR (INTENTIONAL WRONG STRIDE) === +This should produce WRONG checksum to verify our test methodology + +BAD Buddy conv1 cycles: 7082 +Output checksum: 89685778 +(This should NOT match the Gemmini C reference!) +=== BAD Conv1 DONE === +Gemmini extension configured with: + dim = 16 diff --git a/experiments/buddy-benchmarks/logs/conv1-buddy.log b/experiments/buddy-benchmarks/logs/conv1-buddy.log new file mode 100644 index 0000000..89a41db --- /dev/null +++ b/experiments/buddy-benchmarks/logs/conv1-buddy.log @@ -0,0 +1,14 @@ +=== ResNet50 Conv1 - Buddy MLIR === +Input: 4 x 224 x 224 x 3 +Kernel: 7 x 7, stride=2, padding=3 +Output (after pool): 4 x 56 x 56 x 64 +Input checksum: 3461497 +Weight checksum: -199 +Bias checksum: 110400 +Buddy conv1 cycles: 7313 +Output checksum: 10206332 +Output elements: 802816 +First 10 output values: 11 21 0 28 26 31 8 12 27 10 +=== Conv1 Buddy MLIR DONE === +Gemmini extension configured with: + dim = 16 diff --git a/experiments/buddy-benchmarks/logs/conv1-gemmini.log b/experiments/buddy-benchmarks/logs/conv1-gemmini.log new file mode 100644 index 0000000..bd2450a --- /dev/null +++ b/experiments/buddy-benchmarks/logs/conv1-gemmini.log @@ -0,0 +1,16 @@ +=== ResNet50 Conv1 - Gemmini C Reference === +Input: 4 x 224 x 224 x 3 +Kernel: 7 x 7, stride=2, padding=3 +Output (before pool): 4 x 112 x 112 x 64 +Pool: 3 x 3, stride=2, padding=1 +Output (after pool): 4 x 56 x 56 x 64 +Input checksum: 3461497 +Weight checksum: -199 +Bias checksum: 110400 +Conv1 cycles: 225146 +Output checksum: 10206332 +Output elements: 802816 +First 10 output values: 11 21 0 28 26 31 8 12 27 10 +=== Conv1 Gemmini C Reference PASS === +Gemmini extension configured with: + dim = 16 diff --git a/experiments/buddy-benchmarks/resnet50/Makefile b/experiments/buddy-benchmarks/resnet50/Makefile new file mode 100644 index 0000000..00bff2c --- /dev/null +++ b/experiments/buddy-benchmarks/resnet50/Makefile @@ -0,0 +1,241 @@ +# Makefile for ResNet50 Gemmini vs Buddy-MLIR comparison +# +# Targets: +# conv1-gemmini-baremetal - Gemmini C reference (single layer) +# conv1-buddy-baremetal - Buddy MLIR (single layer) +# run-gemmini - Run Gemmini C on Spike +# run-buddy - Run Buddy on Spike +# compare - Run both and compare checksums + +# ============== Paths ============== +RISCV ?= /home/eecs/ashvin.verma/toolchains/riscv +BUDDY ?= /scratch/ashvin/buddy-mlir/build/bin +PK ?= /scratch/ashvin/riscv-pk/build/pk +SPIKE ?= $(RISCV)/bin/spike + +GEMMINI_ROOT := /scratch/ashvin/chipyard/generators/gemmini/software/gemmini-rocc-tests +BENCH_COMMON := $(GEMMINI_ROOT)/riscv-tests/benchmarks/common +GEMMINI_INCLUDE := $(GEMMINI_ROOT)/include +IMAGENET_DIR := $(GEMMINI_ROOT)/imagenet + +# ============== Compilers ============== +CC := $(RISCV)/bin/riscv64-unknown-elf-gcc + +# ============== Flags ============== +CFLAGS := \ + -DPREALLOCATE=1 \ + -DMULTITHREAD=1 \ + -mcmodel=medany \ + -std=gnu99 \ + -O2 \ + -ffast-math \ + -fno-common \ + -fno-builtin-printf \ + -fno-tree-loop-distribute-patterns \ + -march=rv64gc -Wa,-march=rv64gc \ + -I$(GEMMINI_ROOT)/riscv-tests \ + -I$(GEMMINI_ROOT)/riscv-tests/env \ + -I$(GEMMINI_ROOT) \ + -I$(BENCH_COMMON) \ + -I$(GEMMINI_INCLUDE) \ + -I$(IMAGENET_DIR) \ + -Wno-incompatible-pointer-types + +CFLAGS_BAREMETAL := \ + $(CFLAGS) \ + -nostdlib \ + -nostartfiles \ + -static \ + -T $(BENCH_COMMON)/test.ld \ + -DBAREMETAL=1 + +CFLAGS_PK := \ + $(CFLAGS) \ + -static \ + -DBAREMETAL=1 + +LIBS := -lm -lgcc + +# Benchmark common sources +BENCH_SRCS := $(wildcard $(BENCH_COMMON)/*.c) $(wildcard $(BENCH_COMMON)/*.S) + +# ============== Buddy MLIR passes ============== +BUDDY_OPT_FLAGS := \ + -lower-gemmini \ + -convert-scf-to-cf \ + -convert-arith-to-llvm \ + -convert-func-to-llvm \ + -llvm-legalize-for-export + +BUDDY_LLC_FLAGS := \ + -O3 \ + -filetype=obj \ + -mtriple=riscv64-unknown-elf \ + -mattr=+buddyext,+d,+f,+c \ + -float-abi=hard \ + -code-model=medium + +# ============== Targets ============== +.PHONY: all clean run-gemmini run-buddy run-bad compare validate conv2-validate + +all: conv1-gemmini-baremetal conv1-buddy-baremetal conv1-bad-buddy-baremetal + +conv2: conv2-gemmini-baremetal conv2-buddy-baremetal + +# ---- Gemmini C Reference ---- +conv1-gemmini-baremetal: conv1-gemmini.c + $(CC) $(CFLAGS_BAREMETAL) $< $(BENCH_SRCS) $(LIBS) -o $@ + +conv1-gemmini-pk: conv1-gemmini.c + $(CC) $(CFLAGS_PK) $< $(LIBS) -o $@ + +# ---- Buddy MLIR Path ---- +# Step 1: Lower MLIR to LLVM dialect +conv1-buddy.llvm.mlir: conv1-buddy.mlir + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +# Step 2: Translate to LLVM IR +conv1-buddy.ll: conv1-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +# Step 3: Compile to object file +conv1-buddy.o: conv1-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +# Step 4: Link with C harness (baremetal) +conv1-buddy-baremetal: conv1-buddy.c conv1-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< conv1-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# Step 4 (alternate): Link with C harness (pk) +conv1-buddy-pk: conv1-buddy.c conv1-buddy.o + $(CC) $(CFLAGS_PK) $< conv1-buddy.o $(LIBS) -o $@ + +# ---- BAD Buddy MLIR Path (intentionally wrong for validation) ---- +conv1-bad-buddy.llvm.mlir: conv1-bad-buddy.mlir + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +conv1-bad-buddy.ll: conv1-bad-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +conv1-bad-buddy.o: conv1-bad-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +conv1-bad-buddy-baremetal: conv1-bad-buddy.c conv1-bad-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< conv1-bad-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ---- Conv2 (1x1 matmul) ---- +conv2-gemmini-baremetal: conv2-gemmini.c + $(CC) $(CFLAGS_BAREMETAL) $< $(BENCH_SRCS) $(LIBS) -o $@ + +conv2-buddy.llvm.mlir: conv2-buddy.mlir + $(BUDDY)/buddy-opt $< $(BUDDY_OPT_FLAGS) -o $@ + +conv2-buddy.ll: conv2-buddy.llvm.mlir + $(BUDDY)/buddy-translate $< --buddy-to-llvmir -o $@ + +conv2-buddy.o: conv2-buddy.ll + $(BUDDY)/buddy-llc $(BUDDY_LLC_FLAGS) $< -o $@ + +conv2-buddy-baremetal: conv2-buddy.c conv2-buddy.o + $(CC) $(CFLAGS_BAREMETAL) $< conv2-buddy.o $(BENCH_SRCS) $(LIBS) -o $@ + +# ============== Run targets ============== +run-gemmini: conv1-gemmini-baremetal + $(SPIKE) --extension=gemmini $< + +run-gemmini-pk: conv1-gemmini-pk + $(SPIKE) --extension=gemmini $(PK) $< + +run-buddy: conv1-buddy-baremetal + $(SPIKE) --extension=gemmini $< + +run-buddy-pk: conv1-buddy-pk + $(SPIKE) --extension=gemmini $(PK) $< + +run-bad: conv1-bad-buddy-baremetal + $(SPIKE) --extension=gemmini $< + +run-conv2-gemmini: conv2-gemmini-baremetal + $(SPIKE) --extension=gemmini $< + +run-conv2-buddy: conv2-buddy-baremetal + $(SPIKE) --extension=gemmini $< + +conv2-validate: conv2-gemmini-baremetal conv2-buddy-baremetal + @echo "========================================" + @echo " Conv2 (1x1 matmul) Validation " + @echo "========================================" + @echo "" + @echo "--- Gemmini C Reference ---" + @$(SPIKE) --extension=gemmini conv2-gemmini-baremetal 2>&1 | tee conv2-gemmini.log + @echo "" + @echo "--- Buddy MLIR ---" + @$(SPIKE) --extension=gemmini conv2-buddy-baremetal 2>&1 | tee conv2-buddy.log + @echo "" + @echo "=== Conv2 Comparison ===" + @GEMMINI_CKSUM=$$(grep 'Conv2 output checksum:' conv2-gemmini.log | awk '{print $$4}'); \ + BUDDY_CKSUM=$$(grep 'Conv2 output checksum:' conv2-buddy.log | awk '{print $$4}'); \ + echo "Gemmini C checksum: $$GEMMINI_CKSUM"; \ + echo "Buddy checksum: $$BUDDY_CKSUM"; \ + if [ "$$GEMMINI_CKSUM" = "$$BUDDY_CKSUM" ]; then \ + echo "[PASS] Conv2 checksums match"; \ + else \ + echo "[FAIL] Conv2 checksums do NOT match!"; \ + fi + +compare: conv1-gemmini-baremetal conv1-buddy-baremetal + @echo "=== Running Gemmini C Reference ===" + @$(SPIKE) --extension=gemmini conv1-gemmini-baremetal 2>&1 | tee gemmini.log + @echo "" + @echo "=== Running Buddy MLIR ===" + @$(SPIKE) --extension=gemmini conv1-buddy-baremetal 2>&1 | tee buddy.log + @echo "" + @echo "=== Comparison ===" + @echo "Gemmini output checksum: $$(grep 'Output checksum' gemmini.log)" + @echo "Buddy output checksum: $$(grep 'Output checksum' buddy.log)" + +# Full validation including intentional failure case +validate: conv1-gemmini-baremetal conv1-buddy-baremetal conv1-bad-buddy-baremetal + @echo "========================================" + @echo " Conv1 Validation Test Suite " + @echo "========================================" + @echo "" + @echo "--- Test 1: Gemmini C Reference ---" + @$(SPIKE) --extension=gemmini conv1-gemmini-baremetal 2>&1 | tee gemmini.log + @GEMMINI_CKSUM=$$(grep 'Output checksum:' gemmini.log | awk '{print $$3}'); \ + echo "Reference checksum: $$GEMMINI_CKSUM" > validation_result.txt + @echo "" + @echo "--- Test 2: Buddy MLIR (correct) ---" + @$(SPIKE) --extension=gemmini conv1-buddy-baremetal 2>&1 | tee buddy.log + @echo "" + @echo "--- Test 3: Buddy MLIR (INTENTIONAL BAD - wrong stride) ---" + @$(SPIKE) --extension=gemmini conv1-bad-buddy-baremetal 2>&1 | tee bad.log + @echo "" + @echo "========================================" + @echo " VALIDATION RESULTS " + @echo "========================================" + @GEMMINI_CKSUM=$$(grep 'Output checksum:' gemmini.log | awk '{print $$3}'); \ + BUDDY_CKSUM=$$(grep 'Output checksum:' buddy.log | awk '{print $$3}'); \ + BAD_CKSUM=$$(grep 'Output checksum:' bad.log | awk '{print $$3}'); \ + echo "Gemmini C reference checksum: $$GEMMINI_CKSUM"; \ + echo "Buddy MLIR checksum: $$BUDDY_CKSUM"; \ + echo "BAD Buddy checksum: $$BAD_CKSUM"; \ + echo ""; \ + if [ "$$GEMMINI_CKSUM" = "$$BUDDY_CKSUM" ]; then \ + echo "[PASS] Buddy MLIR matches Gemmini C reference"; \ + else \ + echo "[FAIL] Buddy MLIR does NOT match Gemmini C reference!"; \ + fi; \ + if [ "$$GEMMINI_CKSUM" != "$$BAD_CKSUM" ]; then \ + echo "[PASS] BAD test correctly produces different checksum (validation works)"; \ + else \ + echo "[FAIL] BAD test unexpectedly matches reference (validation broken!)"; \ + fi + +# ============== Clean ============== +clean: + rm -f *.o *.ll *.llvm.mlir *.log validation_result.txt + rm -f conv1-gemmini-baremetal conv1-gemmini-pk + rm -f conv1-buddy-baremetal conv1-buddy-pk + rm -f conv1-bad-buddy-baremetal + rm -f conv2-gemmini-baremetal conv2-buddy-baremetal diff --git a/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.c b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.c new file mode 100644 index 0000000..2999c8a --- /dev/null +++ b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.c @@ -0,0 +1,135 @@ +// conv1-bad-buddy.c - C harness for INTENTIONALLY WRONG Buddy-MLIR conv1 +// +// This tests a version with wrong stride to verify checksum validation works + +#include +#include +#include +#include + +#include "include/gemmini.h" +#include "include/gemmini_nn.h" + +#include "resnet50_params.h" +#include "images.h" + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[4]; + int64_t strides[4]; +} MemRef4D_i8; + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[1]; + int64_t strides[1]; +} MemRef1D_i32; + +// External MLIR-compiled function (BAD version with wrong stride) +extern void _mlir_ciface_conv1_bad(MemRef4D_i8 *input, MemRef2D_i8 *weights, + MemRef1D_i32 *bias, MemRef2D_i8 *output); + +static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1, + int64_t d2, int64_t d3) { + MemRef4D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = d0; + ref.sizes[1] = d1; + ref.sizes[2] = d2; + ref.sizes[3] = d3; + ref.strides[3] = 1; + ref.strides[2] = d3; + ref.strides[1] = d2 * d3; + ref.strides[0] = d1 * d2 * d3; + return ref; +} + +static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) { + MemRef1D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = len; + ref.strides[0] = 1; + return ref; +} + +#define POOL_OUT_ROW_DIM 56 +#define POOL_OUT_COL_DIM 56 +#define BATCH_SIZE 4 +#define OUT_CHANNELS 64 +#define PATCH_SIZE 147 + +static elem_t buddy_output[BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM][OUT_CHANNELS]; + +int main(int argc, char *argv[]) { + gemmini_flush(0); + + printf("=== ResNet50 Conv1 - BAD Buddy MLIR (INTENTIONAL WRONG STRIDE) ===\n"); + printf("This should produce WRONG checksum to verify our test methodology\n\n"); + + memset(buddy_output, 0, sizeof(buddy_output)); + + MemRef4D_i8 input_ref = make_memref4_i8( + (elem_t*)&images[0][0][0][0], + BATCH_SIZE, 224, 224, 3); + + MemRef2D_i8 weights_ref = make_memref2_i8( + (elem_t*)&conv_1_w[0][0], + PATCH_SIZE, OUT_CHANNELS); + + MemRef1D_i32 bias_ref = make_memref1_i32( + (acc_t*)&conv_1_b[0], + OUT_CHANNELS); + + MemRef2D_i8 output_ref = make_memref2_i8( + &buddy_output[0][0], + BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM, + OUT_CHANNELS); + + uint64_t start = read_cycles(); + _mlir_ciface_conv1_bad(&input_ref, &weights_ref, &bias_ref, &output_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("BAD Buddy conv1 cycles: %llu\n", (unsigned long long)(end - start)); + + long long output_checksum = 0; + int output_elems = BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM * OUT_CHANNELS; + const elem_t *output_ptr = &buddy_output[0][0]; + for (int i = 0; i < output_elems; i++) { + output_checksum += output_ptr[i]; + } + printf("Output checksum: %lld\n", output_checksum); + printf("(This should NOT match the Gemmini C reference!)\n"); + + printf("=== BAD Conv1 DONE ===\n"); + + return 0; +} diff --git a/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.mlir b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.mlir new file mode 100644 index 0000000..dbd6121 --- /dev/null +++ b/experiments/buddy-benchmarks/resnet50/conv1-bad-buddy.mlir @@ -0,0 +1,25 @@ +// conv1-bad-buddy.mlir - INTENTIONALLY WRONG to validate checksum testing +// +// This uses WRONG parameters (stride=1 instead of stride=2) to verify +// that our checksum comparison can detect failures. + +module { + func.func @conv1_bad(%input: memref<4x224x224x3xi8>, + %weights: memref<147x64xi8>, + %bias: memref<64xi32>, + %output: memref<12544x64xi8>) + attributes { llvm.emit_c_interface } { + // WRONG: Using stride=1 instead of correct stride=2 + // This should produce a completely different (wrong) output + %c112 = arith.constant 112 : i64 + %c7 = arith.constant 7 : i64 + + // INTENTIONAL BUG: stride=1 (should be 2) + gemmini.tile_conv %input %weights %bias %output %c112 %c112 %c7 + {stride = 1, inputDilation = 1, kernelDilation = 1, padding = 3, + act = 1, poolSize = 3, poolStride = 2, poolPadding = 1} : + memref<4x224x224x3xi8> memref<147x64xi8> memref<64xi32> memref<12544x64xi8> + i64 i64 i64 + return + } +} diff --git a/experiments/buddy-benchmarks/resnet50/conv1-buddy.c b/experiments/buddy-benchmarks/resnet50/conv1-buddy.c new file mode 100644 index 0000000..ab32b80 --- /dev/null +++ b/experiments/buddy-benchmarks/resnet50/conv1-buddy.c @@ -0,0 +1,190 @@ +// conv1-buddy.c - C harness for Buddy-MLIR ResNet50 conv_1 layer +// +// This harness: +// 1. Includes the same resnet50_params.h weights as Gemmini C +// 2. Calls the Buddy-compiled conv1 function +// 3. Computes checksums for validation against Gemmini C reference + +#include +#include +#include +#include + +#include "include/gemmini.h" +#include "include/gemmini_nn.h" + +// Include the actual ResNet50 parameters (same weights as Gemmini C reference) +#include "resnet50_params.h" +#include "images.h" + +// Memref descriptor types for MLIR C interface +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[4]; + int64_t strides[4]; +} MemRef4D_i8; + +typedef struct { + elem_t *basePtr; + elem_t *data; + int64_t offset; + int64_t sizes[2]; + int64_t strides[2]; +} MemRef2D_i8; + +typedef struct { + acc_t *basePtr; + acc_t *data; + int64_t offset; + int64_t sizes[1]; + int64_t strides[1]; +} MemRef1D_i32; + +// External MLIR-compiled function +extern void _mlir_ciface_conv1(MemRef4D_i8 *input, MemRef2D_i8 *weights, + MemRef1D_i32 *bias, MemRef2D_i8 *output); + +static MemRef4D_i8 make_memref4_i8(elem_t *base, int64_t d0, int64_t d1, + int64_t d2, int64_t d3) { + MemRef4D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = d0; + ref.sizes[1] = d1; + ref.sizes[2] = d2; + ref.sizes[3] = d3; + ref.strides[3] = 1; + ref.strides[2] = d3; + ref.strides[1] = d2 * d3; + ref.strides[0] = d1 * d2 * d3; + return ref; +} + +static MemRef2D_i8 make_memref2_i8(elem_t *base, int64_t rows, int64_t cols) { + MemRef2D_i8 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = rows; + ref.sizes[1] = cols; + ref.strides[1] = 1; + ref.strides[0] = cols; + return ref; +} + +static MemRef1D_i32 make_memref1_i32(acc_t *base, int64_t len) { + MemRef1D_i32 ref; + ref.basePtr = base; + ref.data = base; + ref.offset = 0; + ref.sizes[0] = len; + ref.strides[0] = 1; + return ref; +} + +// Output buffer - must be static to avoid stack overflow +// Shape: [batch * pool_out_row * pool_out_col][out_channels] = [12544][64] +#define POOL_OUT_ROW_DIM 56 +#define POOL_OUT_COL_DIM 56 +#define BATCH_SIZE 4 +#define OUT_CHANNELS 64 +#define PATCH_SIZE 147 // 7*7*3 + +static elem_t buddy_output[BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM][OUT_CHANNELS]; + +int main(int argc, char *argv[]) { + gemmini_flush(0); + + printf("=== ResNet50 Conv1 - Buddy MLIR ===\n"); + printf("Input: %d x %d x %d x %d\n", + conv_1_params.batch_size, + conv_1_params.in_row_dim, + conv_1_params.in_col_dim, + conv_1_params.in_channels); + printf("Kernel: %d x %d, stride=%d, padding=%d\n", + conv_1_params.kernel_size, conv_1_params.kernel_size, + conv_1_params.stride, conv_1_params.padding); + printf("Output (after pool): %d x %d x %d x %d\n", + conv_1_params.batch_size, + conv_1_params.out_dim_pooled, conv_1_params.out_dim_pooled, + conv_1_params.out_channels); + + // Compute input checksum for verification + long long input_checksum = 0; + const elem_t *input_ptr = &images[0][0][0][0]; + int input_elems = conv_1_params.batch_size * conv_1_params.in_row_dim * + conv_1_params.in_col_dim * conv_1_params.in_channels; + for (int i = 0; i < input_elems; i++) { + input_checksum += input_ptr[i]; + } + printf("Input checksum: %lld\n", input_checksum); + + // Compute weight checksum + long long weight_checksum = 0; + const elem_t *weight_ptr = &conv_1_w[0][0]; + int weight_elems = conv_1_params.patch_size * conv_1_params.out_channels; + for (int i = 0; i < weight_elems; i++) { + weight_checksum += weight_ptr[i]; + } + printf("Weight checksum: %lld\n", weight_checksum); + + // Compute bias checksum + long long bias_checksum = 0; + for (int i = 0; i < conv_1_params.out_channels; i++) { + bias_checksum += conv_1_b[i]; + } + printf("Bias checksum: %lld\n", bias_checksum); + + // Zero output buffer + memset(buddy_output, 0, sizeof(buddy_output)); + + // Create memref descriptors + MemRef4D_i8 input_ref = make_memref4_i8( + (elem_t*)&images[0][0][0][0], + BATCH_SIZE, 224, 224, 3); + + MemRef2D_i8 weights_ref = make_memref2_i8( + (elem_t*)&conv_1_w[0][0], + PATCH_SIZE, OUT_CHANNELS); + + MemRef1D_i32 bias_ref = make_memref1_i32( + (acc_t*)&conv_1_b[0], + OUT_CHANNELS); + + MemRef2D_i8 output_ref = make_memref2_i8( + &buddy_output[0][0], + BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM, + OUT_CHANNELS); + + // Call Buddy-compiled conv1 + uint64_t start = read_cycles(); + _mlir_ciface_conv1(&input_ref, &weights_ref, &bias_ref, &output_ref); + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Buddy conv1 cycles: %llu\n", (unsigned long long)(end - start)); + + // Compute output checksum + long long output_checksum = 0; + int output_elems = BATCH_SIZE * POOL_OUT_ROW_DIM * POOL_OUT_COL_DIM * OUT_CHANNELS; + const elem_t *output_ptr = &buddy_output[0][0]; + for (int i = 0; i < output_elems; i++) { + output_checksum += output_ptr[i]; + } + printf("Output checksum: %lld\n", output_checksum); + printf("Output elements: %d\n", output_elems); + + // Print a few output values for debugging + printf("First 10 output values: "); + for (int i = 0; i < 10; i++) { + printf("%d ", output_ptr[i]); + } + printf("\n"); + + printf("=== Conv1 Buddy MLIR DONE ===\n"); + + return 0; +} diff --git a/experiments/buddy-benchmarks/resnet50/conv1-buddy.mlir b/experiments/buddy-benchmarks/resnet50/conv1-buddy.mlir new file mode 100644 index 0000000..7f84ce4 --- /dev/null +++ b/experiments/buddy-benchmarks/resnet50/conv1-buddy.mlir @@ -0,0 +1,31 @@ +// conv1-buddy.mlir - Buddy MLIR for ResNet50 conv_1 layer +// +// Conv1 params: 7x7 conv, stride=2, padding=3, with 3x3 maxpool +// Input: 4 x 224 x 224 x 3 (batch x height x width x channels) +// Weights: 147 x 64 (patch_size=7*7*3 x out_channels) +// Bias: 64 +// Output: 12544 x 64 (batch*pool_out_row*pool_out_col x out_channels) +// = 4*56*56 x 64 + +module { + func.func @conv1(%input: memref<4x224x224x3xi8>, + %weights: memref<147x64xi8>, + %bias: memref<64xi32>, + %output: memref<12544x64xi8>) + attributes { llvm.emit_c_interface } { + // out_row_dim and out_col_dim are BEFORE pooling + %c112 = arith.constant 112 : i64 + %c7 = arith.constant 7 : i64 + + // gemmini.tile_conv: input weights bias output outRowDim outColDim kernelDim + // Attributes: stride, padding, act (1=ReLU), poolSize, poolStride, poolPadding + // scale = 1.0 / (1 << 8) = 0.00390625 (from conv_1_params.output_scale) + gemmini.tile_conv %input %weights %bias %output %c112 %c112 %c7 + {stride = 2, inputDilation = 1, kernelDilation = 1, padding = 3, + act = 1, poolSize = 3, poolStride = 2, poolPadding = 1, + scale = 0.00390625 : f32} : + memref<4x224x224x3xi8> memref<147x64xi8> memref<64xi32> memref<12544x64xi8> + i64 i64 i64 + return + } +} diff --git a/experiments/buddy-benchmarks/resnet50/conv1-gemmini.c b/experiments/buddy-benchmarks/resnet50/conv1-gemmini.c new file mode 100644 index 0000000..03878a4 --- /dev/null +++ b/experiments/buddy-benchmarks/resnet50/conv1-gemmini.c @@ -0,0 +1,126 @@ +// conv1-gemmini.c - Standalone Gemmini C test for ResNet50 conv_1 layer +// This creates a reference checksum for validation against Buddy-MLIR +// +// Conv1 params: 7x7 conv, stride=2, padding=3, with 3x3 maxpool +// Input: 4 x 224 x 224 x 3 (batch x height x width x channels) +// Output: 4 x 56 x 56 x 64 (after conv + pool) + +#include +#include +#include +#include + +#include "include/gemmini.h" +#include "include/gemmini_nn.h" + +// Include the actual ResNet50 parameters (contains conv_1_w, conv_1_b, conv_1_params) +#include "resnet50_params.h" +#include "images.h" + +int main(int argc, char *argv[]) { + gemmini_flush(0); + + enum tiled_matmul_type_t tiled_matmul_type = WS; + + printf("=== ResNet50 Conv1 - Gemmini C Reference ===\n"); + printf("Input: %d x %d x %d x %d\n", + conv_1_params.batch_size, + conv_1_params.in_row_dim, + conv_1_params.in_col_dim, + conv_1_params.in_channels); + printf("Kernel: %d x %d, stride=%d, padding=%d\n", + conv_1_params.kernel_size, conv_1_params.kernel_size, + conv_1_params.stride, conv_1_params.padding); + printf("Output (before pool): %d x %d x %d x %d\n", + conv_1_params.batch_size, + conv_1_params.out_row_dim, conv_1_params.out_col_dim, + conv_1_params.out_channels); + printf("Pool: %d x %d, stride=%d, padding=%d\n", + conv_1_params.pool_size, conv_1_params.pool_size, + conv_1_params.pool_stride, conv_1_params.pool_padding); + printf("Output (after pool): %d x %d x %d x %d\n", + conv_1_params.batch_size, + conv_1_params.out_dim_pooled, conv_1_params.out_dim_pooled, + conv_1_params.out_channels); + + // Compute input checksum for verification + long long input_checksum = 0; + const elem_t *input_ptr = &images[0][0][0][0]; + int input_elems = conv_1_params.batch_size * conv_1_params.in_row_dim * + conv_1_params.in_col_dim * conv_1_params.in_channels; + for (int i = 0; i < input_elems; i++) { + input_checksum += input_ptr[i]; + } + printf("Input checksum: %lld\n", input_checksum); + + // Compute weight checksum + long long weight_checksum = 0; + const elem_t *weight_ptr = &conv_1_w[0][0]; + int weight_elems = conv_1_params.patch_size * conv_1_params.out_channels; + for (int i = 0; i < weight_elems; i++) { + weight_checksum += weight_ptr[i]; + } + printf("Weight checksum: %lld\n", weight_checksum); + + // Compute bias checksum + long long bias_checksum = 0; + for (int i = 0; i < conv_1_params.out_channels; i++) { + bias_checksum += conv_1_b[i]; + } + printf("Bias checksum: %lld\n", bias_checksum); + + // Run conv_1 with tiled_conv_auto (fused conv + pool) + uint64_t start = read_cycles(); + + tiled_conv_auto( + conv_1_params.batch_size, + conv_1_params.in_row_dim, conv_1_params.in_col_dim, + conv_1_params.in_channels, + conv_1_params.out_channels, + conv_1_params.out_row_dim, conv_1_params.out_col_dim, + conv_1_params.stride, + 1, // input_dilation + 1, // kernel_dilation + conv_1_params.padding, + conv_1_params.kernel_size, + false, false, false, false, false, // transposes + (elem_t*)images, + (elem_t*)conv_1_w, + (acc_t*)conv_1_b, + (elem_t*)conv_1_out_pooled, + RELU, + conv_1_params.output_scale, + conv_1_params.pool_size, + conv_1_params.pool_stride, + conv_1_params.pool_padding, + tiled_matmul_type); + + gemmini_fence(); + uint64_t end = read_cycles(); + + printf("Conv1 cycles: %llu\n", (unsigned long long)(end - start)); + + // Compute output checksum + long long output_checksum = 0; + int output_elems = conv_1_params.batch_size * + conv_1_params.out_dim_pooled * + conv_1_params.out_dim_pooled * + conv_1_params.out_channels; + const elem_t *output_ptr = &conv_1_out_pooled[0][0][0][0]; + for (int i = 0; i < output_elems; i++) { + output_checksum += output_ptr[i]; + } + printf("Output checksum: %lld\n", output_checksum); + printf("Output elements: %d\n", output_elems); + + // Print a few output values for debugging + printf("First 10 output values: "); + for (int i = 0; i < 10; i++) { + printf("%d ", output_ptr[i]); + } + printf("\n"); + + printf("=== Conv1 Gemmini C Reference PASS ===\n"); + + return 0; +} diff --git a/experiments/buddy-benchmarks/scripts/run_benchmark.sh b/experiments/buddy-benchmarks/scripts/run_benchmark.sh new file mode 100755 index 0000000..13e8b9d --- /dev/null +++ b/experiments/buddy-benchmarks/scripts/run_benchmark.sh @@ -0,0 +1,165 @@ +#!/usr/bin/env bash +# run_benchmark.sh - Build and run all Buddy-MLIR Gemmini benchmarks on Spike +# +# Usage: ./scripts/run_benchmark.sh +# +# Prerequisites: +# - RISCV, BUDDY, SPIKE env vars set (or defaults in Makefiles) +# - gemmini-rocc-tests available at expected path + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +ROOT_DIR="$(dirname "$SCRIPT_DIR")" + +PASS=0 +FAIL=0 +TOTAL=0 + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +BOLD='\033[1m' +NC='\033[0m' # No Color + +log_header() { + echo "" + echo -e "${BOLD}========================================${NC}" + echo -e "${BOLD} $1${NC}" + echo -e "${BOLD}========================================${NC}" + echo "" +} + +log_result() { + local name="$1" + local status="$2" + local cycles="$3" + local checksum="$4" + + TOTAL=$((TOTAL + 1)) + if [ "$status" = "PASS" ]; then + PASS=$((PASS + 1)) + echo -e " ${GREEN}[PASS]${NC} $name cycles=$cycles checksum=$checksum" + else + FAIL=$((FAIL + 1)) + echo -e " ${RED}[FAIL]${NC} $name cycles=$cycles checksum=$checksum" + fi +} + +# ============================================================ +# Step 1: Build kernel benchmarks +# ============================================================ +log_header "Building kernel benchmarks" + +cd "$ROOT_DIR/kernels" +make clean 2>/dev/null || true +make all 2>&1 | tail -5 +echo "Kernel benchmarks built." + +# ============================================================ +# Step 2: Run kernel benchmarks on Spike +# ============================================================ +log_header "Running kernel benchmarks on Spike" + +SPIKE="${SPIKE:-${RISCV:-/home/eecs/ashvin.verma/toolchains/riscv}/bin/spike}" + +declare -A EXPECTED_CHECKSUMS +EXPECTED_CHECKSUMS[conv]=950 +EXPECTED_CHECKSUMS[conv-with-pool]=30827 +EXPECTED_CHECKSUMS[mlp2]=252338 +EXPECTED_CHECKSUMS[mlp2-os]=252338 +EXPECTED_CHECKSUMS[mlp1]=258664 +EXPECTED_CHECKSUMS[softmax-matmul]=3860 +EXPECTED_CHECKSUMS[igelu-matmul]=-23260 + +for bench in conv conv-with-pool mlp2 mlp2-os mlp1 softmax-matmul igelu-matmul; do + if [ ! -f "${bench}-baremetal" ]; then + echo -e " ${RED}[SKIP]${NC} $bench - binary not found" + continue + fi + + OUTPUT=$($SPIKE --extension=gemmini "${bench}-baremetal" 2>&1) || true + + # Extract cycles (look for "cycles:" in output) + CYCLES=$(echo "$OUTPUT" | grep -i 'cycles:' | grep -oP '\d+' | tail -1 || echo "N/A") + + # Extract checksum (look for "output checksum:" in output) + CHECKSUM=$(echo "$OUTPUT" | grep -i 'output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A") + + EXPECTED="${EXPECTED_CHECKSUMS[$bench]:-UNKNOWN}" + if [ "$CHECKSUM" = "$EXPECTED" ]; then + log_result "$bench" "PASS" "$CYCLES" "$CHECKSUM" + else + log_result "$bench" "FAIL" "$CYCLES" "$CHECKSUM (expected $EXPECTED)" + fi +done + +# ============================================================ +# Step 3: Build and run ResNet50 validation +# ============================================================ +log_header "Building ResNet50 validation" + +cd "$ROOT_DIR/resnet50" +make clean 2>/dev/null || true +make all 2>&1 | tail -5 +echo "ResNet50 benchmarks built." + +log_header "Running ResNet50 validation on Spike" + +# Run Gemmini C reference +if [ -f "conv1-gemmini-baremetal" ]; then + OUTPUT=$($SPIKE --extension=gemmini conv1-gemmini-baremetal 2>&1) || true + GEMMINI_CYCLES=$(echo "$OUTPUT" | grep -i 'Conv1 cycles:' | grep -oP '\d+' | tail -1 || echo "N/A") + GEMMINI_CHECKSUM=$(echo "$OUTPUT" | grep -i 'Output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A") + echo " Gemmini C: cycles=$GEMMINI_CYCLES checksum=$GEMMINI_CHECKSUM" +fi + +# Run Buddy +if [ -f "conv1-buddy-baremetal" ]; then + OUTPUT=$($SPIKE --extension=gemmini conv1-buddy-baremetal 2>&1) || true + BUDDY_CYCLES=$(echo "$OUTPUT" | grep -i 'conv1 cycles:' | grep -oP '\d+' | tail -1 || echo "N/A") + BUDDY_CHECKSUM=$(echo "$OUTPUT" | grep -i 'Output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A") + + if [ "$BUDDY_CHECKSUM" = "$GEMMINI_CHECKSUM" ]; then + log_result "resnet50-conv1 (buddy)" "PASS" "$BUDDY_CYCLES" "$BUDDY_CHECKSUM" + else + log_result "resnet50-conv1 (buddy)" "FAIL" "$BUDDY_CYCLES" "$BUDDY_CHECKSUM (expected $GEMMINI_CHECKSUM)" + fi +fi + +# Run BAD test (should NOT match) +if [ -f "conv1-bad-buddy-baremetal" ]; then + OUTPUT=$($SPIKE --extension=gemmini conv1-bad-buddy-baremetal 2>&1) || true + BAD_CHECKSUM=$(echo "$OUTPUT" | grep -i 'Output checksum:' | grep -oP '[-]?\d+' | tail -1 || echo "N/A") + + TOTAL=$((TOTAL + 1)) + if [ "$BAD_CHECKSUM" != "$GEMMINI_CHECKSUM" ]; then + PASS=$((PASS + 1)) + echo -e " ${GREEN}[PASS]${NC} resnet50-conv1 (bad) correctly differs: checksum=$BAD_CHECKSUM" + else + FAIL=$((FAIL + 1)) + echo -e " ${RED}[FAIL]${NC} resnet50-conv1 (bad) unexpectedly matches reference!" + fi +fi + +# ============================================================ +# Summary +# ============================================================ +log_header "Summary" + +echo " Total tests: $TOTAL" +echo -e " Passed: ${GREEN}$PASS${NC}" +if [ "$FAIL" -gt 0 ]; then + echo -e " Failed: ${RED}$FAIL${NC}" +else + echo -e " Failed: $FAIL" +fi +echo "" + +if [ "$FAIL" -gt 0 ]; then + echo -e "${RED}Some tests failed!${NC}" + exit 1 +else + echo -e "${GREEN}All tests passed.${NC}" + exit 0 +fi