Optimize the instruction fetch path #103

yy214123 · 2025-10-29T18:53:26Z

This improves the instruction-fetch path by adding new cache structures and enhancing the observability of fetch behavior.

Key Improvements

Replaces the previous 1-entry direct-mapped design with a 2-entry hash-indexed TLB.
Adds a direct-mapped instruction cache that stores recently fetched instructions and provides a fast hit path before triggering address translation.
Provides finer-grained fetch statistics to support and validate the above architectural changes.

Here’s the updated performance comparison:

	1-entry TLB	2-entry TLB	+ 64KB Direct-Mapped I-cache
Fetch hit	95,971,425	96,757,879	97,706,775
Fetch miss	4,028,576	3,242,122 (-19.5%)	1,026,354 (-68%)
Fetch hit rate	95.97%	96.76%	98.96%
Load hit rate	93.15%	93.15%	93.15%
Store hit rate	95.57%	95.57%	95.57%

riscv.h

riscv.c

riscv.h

riscv.c

jserv

Unify the naming scheme.

jserv

Rebase the latest 'master' branch and resolve build errors.

visitorckw

This series appears to contain several "fix-up," "refactor," or "build-fix" commits that correct or adjust a preceding patch.

To maintain a clean history and ensure the project is bisectable, each patch in a series should be complete and correct on its own.

visitorckw · 2025-11-04T18:31:37Z

As a friendly reminder regarding project communication:

Please ensure that when you quote-reply to others' comments, you do not translate the quoted text into any language other than English.

This is an open-source project, and it's important that we keep all discussions in English. This ensures that the conversation remains accessible to everyone in the community, including current and future participants who may not be familiar with other languages.

visitorckw · 2025-11-04T19:02:22Z

riscv.c

+            icache_block_t tmp = *blk;
+            *blk = *vblk;
+            *vblk = tmp;
+            blk->tag = tag;


This code looks suspicious to me.

When you move the evicted I-cache block (tmp) back into the victim cache, you are setting the vblk->tag to tmp.tag, which is the 16-bit I-cache tag.

Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.

Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.

Thank you for pointing that out. I’ve added the following expressions to ensure correctness :

+ vblk->tag = (tmp.tag << ICACHE_INDEX_BITS) | idx;

jserv · 2025-11-12T08:16:50Z

Found that the victim cache implementation still has significant room for improvement:

What are you going to proceed with?

yy214123 · 2025-11-12T08:35:58Z

What are you going to proceed with?

Not sure yet if it’s due to the FIFO policy or the victim cache size.
I’ll try a simple approximate LRU version (with a counter) to see how much the replacement policy affects the stats.

Does this approach make sense to you, or would you suggest a different direction?

jserv · 2025-11-13T16:05:10Z

Does this approach make sense to you, or would you suggest a different direction?

When in doubt, measure before you choose a method.

yy214123 · 2025-11-13T16:24:56Z

When in doubt, measure before you choose a method.

I ran some benchmarks on this new LRU setup, with all cache statistics generated after executing 100 million instructions to ensure a fair comparison. Here's what I found.

With the same 256 I-Cache / 16 V-Cache blocks, just switching FIFO to LRU only gave a slight hit rate boost:

	LRU	FIFO
victim cache hit rate	4.16%	4.10%
victim cache miss rate	95.84%	95.90%

I then doubled the V-Cache (16 -> 32 blocks), but the miss rate was still really high:

	LRU	FIFO
victim cache hit rate	5.74%	5.68%
victim cache miss rate	94.26%	94.32%

This means the misses hitting the V-Cache are almost all compulsory misses, not the conflict misses , to prove this, I intentionally shrank the I-Cache (256 -> 64) to force more conflicts.

	LRU	FIFO
victim cache hit rate	11.85%	11.64%
victim cache miss rate	88.15%	88.36%

The LRU logic is working fine. The key takeaway is that compared to FIFO, we did see a tiny improvement, which confirms the LRU strategy successfully handled the few conflict misses that were present.

jserv · 2025-11-14T18:41:51Z

The measurements reveal that 98% of the benefit comes from a simple 2-entry TLB cache, not from the elaborate 64KB I-cache + 16-block victim cache with LRU. The working set fits comfortably in 64KB. The victim cache is solving a problem that doesn't exist. These are compulsory misses, not conflict misses.

LRU Overhead Cost:

  // O(N) linear scan on EVERY I-cache miss (6.19M times):
  for (int i = 1; i < 16; i++) {  // 16 comparisons each
      if (vm->icache.v_used[i] < lru_min) { ... }
  }

6.19M misses × 16 comparisons = 99M operations
Saves only 405K victim cache hits

jserv · 2025-11-14T18:48:40Z

riscv.c

+
+    /* fill into the icache */
+    uint32_t block_off = (addr & RV_PAGE_MASK) & ~ICACHE_BLOCK_MASK;
+    blk->base = (const uint8_t *) vm->cache_fetch[index].page_addr + block_off;


Pointer aliasing time bomb:

I-cache stores blk->base pointing to physical memory

TLB stores page_addr pointing to same memory

These can become desynchronized on page remapping → potential use-after-free

Does this mean that I also need to add I-cache and victim-cache invalidation inside the mmu_invalidate_range?

I have added the relevant code in mmu_invalidate_range:

for (int i = 0; i < ICACHE_BLOCKS; i++) { icache_block_t *blk = &vm->icache.block[i]; if (!blk->valid) continue; uint32_t icache_vpn = (blk->tag << ICACHE_INDEX_BITS) | i; icache_vpn >>= (RV_PAGE_SHIFT - ICACHE_OFFSET_BITS); if (icache_vpn >= start_vpn && icache_vpn <= end_vpn) blk->valid = false; }

riscv.c

Replace the previous 1-entry direct-mapped design with a 2-entry direct-mapped cache using hash-based indexing (same parity hash as cache_load). This allows two hot virtual pages to coexist without thrashing. Measurement shows that the number of virtual-to-physical translations during instruction fetch (mmu_translate() calls) decreased by ~19%.

Extend the existing architecture to include a direct-mapped instruction cache that stores recently fetched instructions. Measurement shows that the number of virtual-to-physical translations during instruction fetch (mmu_translate() calls) decreased by ~65%.

yy214123 · 2025-11-18T16:19:05Z

After discussing with @jserv , we decided to remove the victim cache since the benefit was limited and the implementation added unnecessary complexity.
I also reordered the commits to ensure the 2-entry TLB results aren’t affected by the I-cache changes.

Here’s the updated performance comparison:

	1-entry TLB	2-entry TLB	+ 64KB Direct-Mapped I-cache
Fetch hit	95,971,425	96,757,879	97,706,775
Fetch miss	4,028,576	3,242,122 (-19.5%)	1,026,354 (-68%)
Fetch hit rate	95.97%	96.76%	98.96%
Load hit rate	93.15%	93.15%	93.15%
Store hit rate	95.57%	95.57%	95.57%

riscv.h

jserv · 2025-11-19T08:27:21Z

we decided to remove the victim cache since the benefit was limited and the implementation added unnecessary complexity. I also reordered the commits to ensure the 2-entry TLB results aren’t affected by the I-cache changes.

Update the descriptions of this pull request.

yy214123 · 2025-11-19T11:37:39Z

Update the descriptions of this pull request.

I have updated the title and description at the top.

jserv · 2025-11-24T20:51:03Z

Is this pull request ready to proceed?

yy214123 · 2025-11-25T04:42:28Z

Is this pull request ready to proceed?

Yes, I am currently keeping only the simple implementations that can effectively bring improvements.

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (all 1 issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="main.c">

<violation number="1" location="main.c:1067">
P3: Per-hart cache statistics no longer include the hart index, so multi-hart output is ambiguous and you can’t tell which hart each block refers to.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

cubic-dev-ai · 2025-11-30T10:44:17Z

main.c

-                    100.0 * hart->cache_fetch.hits / fetch_total);
-        fprintf(stderr, "\n");

+        fprintf(stderr, "\n=== Introduction Cache Statistics ===\n");


P3: Per-hart cache statistics no longer include the hart index, so multi-hart output is ambiguous and you can’t tell which hart each block refers to.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At main.c, line 1067: <comment>Per-hart cache statistics no longer include the hart index, so multi-hart output is ambiguous and you can’t tell which hart each block refers to.</comment> <file context> @@ -1047,14 +1063,25 @@ static void print_mmu_cache_stats(vm_t *vm) - 100.0 * hart->cache_fetch.hits / fetch_total); - fprintf(stderr, "\n"); + fprintf(stderr, "\n=== Introduction Cache Statistics ===\n"); + fprintf(stderr, " Total access: %12llu\n", access_total); + if (access_total > 0) { </file context>

Suggested change

fprintf(stderr, "\n=== Introduction Cache Statistics ===\n");

fprintf(stderr, "\nHart %u:\n=== Instruction Cache Statistics ===\n", i);

I have added the relevant code.
After compiling using $make SMP=2 check, the following statistics are output:

=== MMU Cache Statistics === Hart 0: === Introduction Cache Statistics === Total access: 358547968 Icache hits: 349789985 (97.56%) Icache misses: 8757983 (2.44%) ├ TLB hits: 4108502 (46.91%) └ TLB misses: 4649481 (53.09%) === Data Cache Statistics === Load: 74468422 hits, 4645526 misses (8x2) (94.13% hit rate) Store: 60833790 hits, 1337401 misses (8x2) (97.85% hit rate) Hart 1: === Introduction Cache Statistics === Total access: 313707200 Icache hits: 305442748 (97.37%) Icache misses: 8264452 (2.63%) ├ TLB hits: 3563779 (43.12%) └ TLB misses: 4700673 (56.88%) === Data Cache Statistics === Load: 68932725 hits, 3193502 misses (8x2) (95.57% hit rate) Store: 50140013 hits, 754139 misses (8x2) (98.52% hit rate)

Introduce detailed metrics for total fetches, icache hits/misses, and TLB hits/misses to replace the old aggregated MMU stats. This provides more accurate profiling and clearer insight into instruction fetch behavior.

jserv · 2025-12-04T05:03:07Z

Thank @yy214123 for contributing!

yy214123 marked this pull request as draft October 29, 2025 18:56

This comment was marked as outdated.

Sign in to view

yy214123 force-pushed the direct-mapped-cache branch 2 times, most recently from 98114a7 to 0e4f67b Compare October 30, 2025 05:32

jserv reviewed Oct 30, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Oct 30, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Oct 30, 2025

View reviewed changes

riscv.c Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

yy214123 force-pushed the direct-mapped-cache branch 2 times, most recently from bb9e6cb to 74e3b99 Compare October 30, 2025 19:24

yy214123 closed this Oct 31, 2025

This comment was marked as outdated.

Sign in to view

yy214123 reopened this Oct 31, 2025

This comment was marked as resolved.

Sign in to view

yy214123 force-pushed the direct-mapped-cache branch from 686cede to 5478710 Compare November 1, 2025 06:40

yy214123 requested a review from jserv November 2, 2025 09:08

jserv reviewed Nov 2, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Nov 2, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

jserv reviewed Nov 2, 2025

View reviewed changes

riscv.c Outdated Show resolved Hide resolved

jserv requested changes Nov 2, 2025

View reviewed changes

yy214123 force-pushed the direct-mapped-cache branch from fe2d95e to f657fb2 Compare November 4, 2025 16:55

jserv requested changes Nov 4, 2025

View reviewed changes

yy214123 force-pushed the direct-mapped-cache branch from aab465c to db3a37f Compare November 4, 2025 18:20

sysprog21 deleted a comment from yy214123 Nov 4, 2025

This comment was marked as outdated.

Sign in to view

visitorckw suggested changes Nov 4, 2025

View reviewed changes

visitorckw reviewed Nov 4, 2025

View reviewed changes

yy214123 force-pushed the direct-mapped-cache branch from db3a37f to c6485fc Compare November 6, 2025 09:56

sysprog21 deleted a comment from cubic-dev-ai bot Nov 12, 2025

This comment was marked as resolved.

Sign in to view

yy214123 force-pushed the direct-mapped-cache branch from bc84f9b to 40ce39a Compare November 13, 2025 15:53

jserv reviewed Nov 14, 2025

View reviewed changes

riscv.c Outdated Show resolved Hide resolved

jserv reviewed Nov 14, 2025

View reviewed changes

riscv.c Outdated Show resolved Hide resolved

yy214123 added 2 commits November 18, 2025 21:18

yy214123 force-pushed the direct-mapped-cache branch from 40ce39a to 95693bb Compare November 18, 2025 16:02

jserv reviewed Nov 19, 2025

View reviewed changes

riscv.h Outdated Show resolved Hide resolved

yy214123 force-pushed the direct-mapped-cache branch from 95693bb to 27608c7 Compare November 19, 2025 11:13

yy214123 changed the title ~~Implement direct mapped cache for instruction fetch~~ Optimize the instruction fetch path Nov 19, 2025

yy214123 marked this pull request as ready for review November 30, 2025 10:28

cubic-dev-ai bot reviewed Nov 30, 2025

View reviewed changes

Add fine-grained fetch counters

7658a3f

Introduce detailed metrics for total fetches, icache hits/misses, and TLB hits/misses to replace the old aggregated MMU stats. This provides more accurate profiling and clearer insight into instruction fetch behavior.

yy214123 force-pushed the direct-mapped-cache branch from 27608c7 to 7658a3f Compare November 30, 2025 15:43

yy214123 requested a review from jserv December 2, 2025 06:04

jserv merged commit 10cdcbc into sysprog21:master Dec 4, 2025
9 of 10 checks passed

	fprintf(stderr, "\n=== Introduction Cache Statistics ===\n");
	fprintf(stderr, "\nHart %u:\n=== Instruction Cache Statistics ===\n", i);

Optimize the instruction fetch path #103

Optimize the instruction fetch path #103

Uh oh!

Conversation

yy214123 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

visitorckw left a comment

Choose a reason for hiding this comment

Uh oh!

visitorckw commented Nov 4, 2025

Uh oh!

visitorckw Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

yy214123 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

jserv commented Nov 12, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

yy214123 commented Nov 12, 2025

Uh oh!

jserv commented Nov 13, 2025

Uh oh!

yy214123 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jserv commented Nov 14, 2025

Uh oh!

jserv Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

yy214123 Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

yy214123 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yy214123 commented Nov 18, 2025

Uh oh!

Uh oh!

jserv commented Nov 19, 2025

Uh oh!

yy214123 commented Nov 19, 2025

Uh oh!

jserv commented Nov 24, 2025

Uh oh!

yy214123 commented Nov 25, 2025

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

yy214123 commented Oct 29, 2025 •

edited

Loading

yy214123 commented Nov 13, 2025 •

edited

Loading

cubic-dev-ai bot Nov 30, 2025 •

edited

Loading