-
Notifications
You must be signed in to change notification settings - Fork 64
Optimize the instruction fetch path #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
98114a7 to
0e4f67b
Compare
This comment was marked as outdated.
This comment was marked as outdated.
bb9e6cb to
74e3b99
Compare
This comment was marked as outdated.
This comment was marked as outdated.
686cede to
5478710
Compare
jserv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unify the naming scheme.
fe2d95e to
f657fb2
Compare
jserv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase the latest 'master' branch and resolve build errors.
aab465c to
db3a37f
Compare
visitorckw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This series appears to contain several "fix-up," "refactor," or "build-fix" commits that correct or adjust a preceding patch.
To maintain a clean history and ensure the project is bisectable, each patch in a series should be complete and correct on its own.
|
As a friendly reminder regarding project communication: Please ensure that when you quote-reply to others' comments, you do not translate the quoted text into any language other than English. This is an open-source project, and it's important that we keep all discussions in English. This ensures that the conversation remains accessible to everyone in the community, including current and future participants who may not be familiar with other languages. |
riscv.c
Outdated
| icache_block_t tmp = *blk; | ||
| *blk = *vblk; | ||
| *vblk = tmp; | ||
| blk->tag = tag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks suspicious to me.
When you move the evicted I-cache block (tmp) back into the victim cache, you are setting the vblk->tag to tmp.tag, which is the 16-bit I-cache tag.
Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this corrupts the victim cache entry? The VC search logic requires a 24-bit tag ([ICache Tag | ICache Index]) to function. Because you're only storing the 16-bit tag, this VCache entry will never be hit again.
Thank you for pointing that out. I’ve added the following expressions to ensure correctness :
+ vblk->tag = (tmp.tag << ICACHE_INDEX_BITS) | idx;db3a37f to
c6485fc
Compare
What are you going to proceed with? |
Not sure yet if it’s due to the FIFO policy or the victim cache size. Does this approach make sense to you, or would you suggest a different direction? |
bc84f9b to
40ce39a
Compare
When in doubt, measure before you choose a method. |
I ran some benchmarks on this new LRU setup, with all cache statistics generated after executing 100 million instructions to ensure a fair comparison. Here's what I found. With the same 256 I-Cache / 16 V-Cache blocks, just switching FIFO to LRU only gave a slight hit rate boost:
I then doubled the V-Cache (16 -> 32 blocks), but the miss rate was still really high:
This means the misses hitting the V-Cache are almost all compulsory misses, not the conflict misses , to prove this, I intentionally shrank the I-Cache (256 -> 64) to force more conflicts.
The LRU logic is working fine. The key takeaway is that compared to FIFO, we did see a tiny improvement, which confirms the LRU strategy successfully handled the few conflict misses that were present. |
|
The measurements reveal that 98% of the benefit comes from a simple 2-entry TLB cache, not from the elaborate 64KB I-cache + 16-block victim cache with LRU. The working set fits comfortably in 64KB. The victim cache is solving a problem that doesn't exist. These are compulsory misses, not conflict misses. LRU Overhead Cost:
|
|
|
||
| /* fill into the icache */ | ||
| uint32_t block_off = (addr & RV_PAGE_MASK) & ~ICACHE_BLOCK_MASK; | ||
| blk->base = (const uint8_t *) vm->cache_fetch[index].page_addr + block_off; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointer aliasing time bomb:
- I-cache stores blk->base pointing to physical memory
- TLB stores page_addr pointing to same memory
- These can become desynchronized on page remapping → potential use-after-free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that I also need to add I-cache and victim-cache invalidation inside the mmu_invalidate_range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the relevant code in mmu_invalidate_range:
for (int i = 0; i < ICACHE_BLOCKS; i++) {
icache_block_t *blk = &vm->icache.block[i];
if (!blk->valid)
continue;
uint32_t icache_vpn = (blk->tag << ICACHE_INDEX_BITS) | i;
icache_vpn >>= (RV_PAGE_SHIFT - ICACHE_OFFSET_BITS);
if (icache_vpn >= start_vpn && icache_vpn <= end_vpn)
blk->valid = false;
}Replace the previous 1-entry direct-mapped design with a 2-entry direct-mapped cache using hash-based indexing (same parity hash as cache_load). This allows two hot virtual pages to coexist without thrashing. Measurement shows that the number of virtual-to-physical translations during instruction fetch (mmu_translate() calls) decreased by ~19%.
Extend the existing architecture to include a direct-mapped instruction cache that stores recently fetched instructions. Measurement shows that the number of virtual-to-physical translations during instruction fetch (mmu_translate() calls) decreased by ~65%.
40ce39a to
95693bb
Compare
|
After discussing with @jserv , we decided to remove the victim cache since the benefit was limited and the implementation added unnecessary complexity. Here’s the updated performance comparison:
|
Update the descriptions of this pull request. |
95693bb to
27608c7
Compare
I have updated the title and description at the top. |
|
Is this pull request ready to proceed? |
Yes, I am currently keeping only the simple implementations that can effectively bring improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 3 files
Prompt for AI agents (all 1 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="main.c">
<violation number="1" location="main.c:1067">
P3: Per-hart cache statistics no longer include the hart index, so multi-hart output is ambiguous and you can’t tell which hart each block refers to.</violation>
</file>
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
| 100.0 * hart->cache_fetch.hits / fetch_total); | ||
| fprintf(stderr, "\n"); | ||
|
|
||
| fprintf(stderr, "\n=== Introduction Cache Statistics ===\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: Per-hart cache statistics no longer include the hart index, so multi-hart output is ambiguous and you can’t tell which hart each block refers to.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At main.c, line 1067:
<comment>Per-hart cache statistics no longer include the hart index, so multi-hart output is ambiguous and you can’t tell which hart each block refers to.</comment>
<file context>
@@ -1047,14 +1063,25 @@ static void print_mmu_cache_stats(vm_t *vm)
- 100.0 * hart->cache_fetch.hits / fetch_total);
- fprintf(stderr, "\n");
+ fprintf(stderr, "\n=== Introduction Cache Statistics ===\n");
+ fprintf(stderr, " Total access: %12llu\n", access_total);
+ if (access_total > 0) {
</file context>
| fprintf(stderr, "\n=== Introduction Cache Statistics ===\n"); | |
| fprintf(stderr, "\nHart %u:\n=== Instruction Cache Statistics ===\n", i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the relevant code.
After compiling using $make SMP=2 check, the following statistics are output:
=== MMU Cache Statistics ===
Hart 0:
=== Introduction Cache Statistics ===
Total access: 358547968
Icache hits: 349789985 (97.56%)
Icache misses: 8757983 (2.44%)
├ TLB hits: 4108502 (46.91%)
└ TLB misses: 4649481 (53.09%)
=== Data Cache Statistics ===
Load: 74468422 hits, 4645526 misses (8x2) (94.13% hit rate)
Store: 60833790 hits, 1337401 misses (8x2) (97.85% hit rate)
Hart 1:
=== Introduction Cache Statistics ===
Total access: 313707200
Icache hits: 305442748 (97.37%)
Icache misses: 8264452 (2.63%)
├ TLB hits: 3563779 (43.12%)
└ TLB misses: 4700673 (56.88%)
=== Data Cache Statistics ===
Load: 68932725 hits, 3193502 misses (8x2) (95.57% hit rate)
Store: 50140013 hits, 754139 misses (8x2) (98.52% hit rate)Introduce detailed metrics for total fetches, icache hits/misses, and TLB hits/misses to replace the old aggregated MMU stats. This provides more accurate profiling and clearer insight into instruction fetch behavior.
27608c7 to
7658a3f
Compare
|
Thank @yy214123 for contributing! |
This improves the instruction-fetch path by adding new cache structures and enhancing the observability of fetch behavior.
Key Improvements
Here’s the updated performance comparison: