-
Notifications
You must be signed in to change notification settings - Fork 15
RHX Upstream patches #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
RHX Upstream patches #188
Conversation
This commit introduces a new -mtune=rmx100 tuning option together with relevant scheduler definitions. Instruction latencies and costs are based on the "RMX-100 Technical Reference Manual" document (revision 0.4, 13 September 2023) and are subject to change. The changes have been verified by running the Dhrystone and Coremark benchmarks and observing expected (small) improvements compared to the -mtune=generic results. Signed-off-by: Artemiy Volkov <[email protected]>
This commit adds the new arcv-mpy-option compilation parameter with the valid (string) values of 1c, 2c, and 10c. This corresponds to different versions of the MPY/DIV unit of the RMX100 core, each of which has different latencies for imul/idiv instructions. Internally, this option is propagated to the pipeline description information in rmx100.md with the use of new helper functions defined in riscv.cc. Signed-off-by: Artemiy Volkov <[email protected]>
This patch adds latencies related to FPU instructions to arcv-rmx100.md. The specific values used correspond to the 'fast' config, except fdiv where the latency was reduced to 10 cycles. In the future, FP latencies for RMX-100 should be made dependent on an external (-mfpu-like) option. Signed-off-by: Artemiy Volkov <[email protected]>
Signed-off-by: Claudiu Zissulescu <[email protected]>
For the RMX-500 and RHX cores, the sequence "load-immediate + store" (that is used to store a constant value) can be executed in 1 cycle, provided the two instructions are kept next to one another. This patch handles this case in riscv_macro_fusion_pair_p(). Signed-off-by: Artemiy Volkov <[email protected]>
ARC-V related optimisations must be guarded like:
if (riscv_microarchitecture == <arch>) {
...
}
Introduce an inline function that encapsulates this:
static inline bool riscv_is_micro_arch (<arch>)
Use it to define __riscv_rhx whenever compiling for the RHX
microarchitecture.
Signed-off-by: Shahab Vahedi <[email protected]>
With this commit, we allow a load-immediate to be macro-op fused with a successive conditional branch that is dependent on it, e.g.: li t0, #imm bge t1, t0, .label Additionally, add a new testcase to check that this fusion type is handled correctly. Signed-off-by: Artemiy Volkov <[email protected]>
To take better advantage of double load/store fusion, make use of the sched_fusion pass that assigns unique "fusion priorities" to load/store instructions and schedules operations on adjacent addresses together. This maximizes the probability that loads/stores are fused between each other instead of with other instructions. Signed-off-by: Artemiy Volkov <[email protected]>
With this patch, arcv_macro_fusion_pair_p () recognizes instruction pairs like: LOAD rd1, [rs1,offset] add/sub rd2, rs1, rs2/imm (where all regs are distinct) and: STORE rs2, [rs1,offset] add/sub rd, rs1, rs2/imm as fused macro-op pairs. In the case of a load, rd1 being equal to rd2, rs1, or rs2 would lead to data hazards, hence this is disallowed; for stores, rs1 and rs2 of the two instructions must match. Signed-off-by: Artemiy Volkov <[email protected]>
Fuse together instruction pairs such as: LOAD rd1, [rs1,offset] lui rd2, imm (where rd1 and rd2 are distinct) and: STORE rs2, [rs1,offset] lui rd, imm Signed-off-by: Artemiy Volkov <[email protected]>
The RHX core executes integer multiply-add sequences of the form:
mul r1,r2,r3
add r1,r1,r4
in 1 cycle due to macro-op fusion. This patch adds a
define_insn_and_split to recognize the above sequence and preserve it as
a single insn up until the post-reload split pass.
Since, due to a microarchitectural restriction, the output operand of
both instructions must be the same register, the insn_and_split pattern
has two alternatives corresponding to the following cases: (0) r1 is
different from r4, in which case the insn can be split to the sequence
above; (1) r1 and r4 are the same, in which case a temporary register
has to be used and there is no fusion. Alternative (1) is discouraged
so that reload maximizes the number of instances where MAC fusion can be
applied. Since RHX is a rv32im core, the pattern requires that the
target is 32-bit and supports multiplication.
In addition, the {u,}maddhisi3 expand is implemented for RHX to
convert the ( 16-bit x 16-bit + 32_bit ) WIDEN_MULT_PLUS_EXPR GIMPLE
operator to the aforementioned madd_split instruction directly.
Lastly, a very basic testcase is introduced to make sure that the
new patterns are sufficient to produce MAC-fusion-aware code.
No new dejagnu failures with RUNTESTFLAGS="CFLAGS_FOR_TARGET=-mtune=rhx
dg.exp".
Signed-off-by: Artemiy Volkov <[email protected]>
To make sure that the multiply-add pairs (split post-reload from the madd_split instruction) are not broken up by the sched2 pass, designate them as fusable in arcv_macro_fusion_pair_p (). Signed-off-by: Artemiy Volkov <[email protected]>
The bitfield zero_extract operation is normally expanded into an srai followed by an andi. (With the ZBS extension enabled, the special case of 1-bit zero-extract is implemented with the bexti insn.) However, since the RHX core can execute a shift-left and a shift-right of the same register in 1 cycle, we would prefer to emit those two instructions instead, and schedule them together so that macro fusion can take place. The required steps to achieve this are: (1) Create an insn_and_split that handles the zero_extract RTX; (2) Tell the combiner to use that split by lowering the cost of the zero_extract RTX when the target is the RHX core; (3) Designate the resulting slli + srli pair as fusable by the scheduler. Attached is a small testcase demonstrating the split, and that the bexti insn still takes priority over the shift pair. Signed-off-by: Artemiy Volkov <[email protected]>
Some fusion types (namely, LD/ST-OP/OPIMM and LD/ST-LUI) are available regardless of the order of instructions. To support this, extract the new arcv_memop_arith_pair_p () and arcv_memop_lui_pair_p () functions and call them twice. Signed-off-by: Artemiy Volkov <[email protected]>
This commit implements the scheduling model for the RHX-100 core. Among notable
things are:
(1) The arcv_macro_fusion_pair_p () hook has been modified to not create
SCHED_GROUP's larger than 2 instructions; also, it gives priority to double
load/store fusion, suppressing the other types until sched2.
(2) riscv_issue_rate () is set to 4 and the system is modeled as 4 separate
pipelines, giving access to as many instructions in ready_list as possible.
(3) The rhx.md description puts some initial constraints in place (e.g. memory
ops can only go into pipe B), saving some work in the reordering hook.
(4) The riscv_sched_variable_issue () and riscv_sched_reorder2 () hooks work
together to make sure (in order of descending priority) that:
(a) the critical path and the instruction priorities are respected;
(b) both pipes are filled (taking advantage of parallel dispatch within the
microarchitectural constraints);
(c) there is as much fusion going on as possible (and the existing fusion
pairs are not broken up).
There is probably some room for improvement, and some tweaks will probably have
to be made in response to HLA changes as the HW development process goes on.
Signed-off-by: Artemiy Volkov <[email protected]>
This patch implements riscv_sched_adjust_priority () for the RHX-100 microarchitecture by slightly bumping the priority of load/store pairs. As a consequence of this change, it becomes easier for riscv_sched_reorder2 () to schedule instructions in the memory pipe. Signed-off-by: Artemiy Volkov <[email protected]>
In addition to the LW+LW and SW+SW pairs that are already being recognized as macro-op-fusable, add support for 8-bit and naturally aligned 16-bit loads operating on adjacent memory locations. To that end, introduce the new microarch-specific pair_fusion_mode_allowed_p () predicate, and call it from fusion_load_store () during sched_fusion, and from arcv_macro_fusion_pair_p () during regular scheduling passes. Signed-off-by: Artemiy Volkov <[email protected]>
Currently on ARC-V, the maddhisi3 pattern always expands to the madd_split_fused instruction regardless of the target word size, which leads to the full-width mul and add instructions being emitted for 32-bit data even on riscv64: mul a6,a4,s6 add a6,a6,s7 sext.w s7,a6 To fix this, add another define_insn (madd_split_fused_extended) pattern wrapping the result of a MAC operation into a sign-extension from 32 to 64 bits, and use it in the (u)maddhisi3 expander in case of a 64-bit target. The assembly code after this change is more efficient, viz.: mulw a6,a4,s6 addw a6,a6,s7 Signed-off-by: Artemiy Volkov <[email protected]>
This define_insn_and_split prevents *zero_extract_fused from being selected. Updated the test. It succeeded despite the fused case not being selected because the right instructions were produced still. Signed-off-by: Michiel Derhaeg <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Align with how other RISC-V arch are checked (e.g., TARGET_SIFIVE_7, TARGET_ROCKET, TARGET_SIFIVE_P400_SERIES). Signed-off-by: Luis Silva <[email protected]>
In these cases, it makes more sense to check if ARCV fusion is enabled rather than checking if a specific mtune is enabled. Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Group alu_pipe_scheduled_p, pipeB_scheduled_p, last_scheduled_insn, and cached_can_issue_more into arcv_sched_state struct. Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
gcc/config/riscv/arcv.cc
Outdated
| /* ARCV-specific macro-op fusion for RISC-V. | ||
| Copyright (C) 2025 Free Software Foundation, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would describe this more general. "ARC-V-specific functions something something"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the headers of arcv.{cc,h} to match the style used in other architecture-specific files. What do you think?
| if (REG_P (XEXP (curr_plus, 1)) | ||
| && REGNO (XEXP (curr_plus, 1)) == mult_dest_regno) | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller adds debug logs to each of these, I think we should also do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the respective debug logs.
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
Signed-off-by: Luis Silva <[email protected]>
gcc/config/riscv/arcv.cc
Outdated
| else if (get_attr_type (ready[*n_readyp - 1]) != TYPE_LOAD | ||
| || get_attr_type (ready[*n_readyp - 1]) != TYPE_STORE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this condition is correct, it is always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im wondering if this is what is supposed to be:
From 4f60a816bd1f2dd28e471b473460321e48e2a99c Mon Sep 17 00:00:00 2001
From: Luis Silva <[email protected]>
Date: Tue, 18 Nov 2025 15:12:07 +0000
Subject: [PATCH] tmp: arcv: Fix incorrect condition and logic in reorder
scheduler.
Signed-off-by: Luis Silva <[email protected]>
---
gcc/config/riscv/arcv.cc | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/gcc/config/riscv/arcv.cc b/gcc/config/riscv/arcv.cc
index 8354d838c72..f0a39a23cb7 100644
--- a/gcc/config/riscv/arcv.cc
+++ b/gcc/config/riscv/arcv.cc
@@ -670,22 +670,23 @@ arcv_sched_reorder2 (rtx_insn **ready, int *n_readyp)
/* If all else fails, schedule a single instruction. */
if (ready && *n_readyp > 0
&& NONDEBUG_INSN_P (ready[*n_readyp - 1])
- && recog_memoized (ready[*n_readyp - 1]) >= 0
- && get_attr_type (ready[*n_readyp - 1]) != TYPE_LOAD
- && get_attr_type (ready[*n_readyp - 1]) != TYPE_STORE)
+ && recog_memoized (ready[*n_readyp - 1]) >= 0)
{
+ rtx_insn *insn = ready[*n_readyp - 1];
+ enum attr_type insn_type = get_attr_type (insn);
+
+ /* Memory operations go to pipeB if available. */
if (!sched_state.pipeB_scheduled_p
- && (get_attr_type (ready[*n_readyp - 1]) == TYPE_LOAD
- || get_attr_type (ready[*n_readyp - 1]) == TYPE_STORE))
+ && (insn_type == TYPE_LOAD || insn_type == TYPE_STORE))
{
- sched_state.alu_pipe_scheduled_p = sched_state.pipeB_scheduled_p = 1;
+ sched_state.pipeB_scheduled_p = 1;
sched_state.cached_can_issue_more = 1;
return 1;
}
- else if (get_attr_type (ready[*n_readyp - 1]) != TYPE_LOAD
- || get_attr_type (ready[*n_readyp - 1]) != TYPE_STORE)
+ /* Non-memory operations go to ALU pipe. */
+ else if (insn_type != TYPE_LOAD && insn_type != TYPE_STORE)
{
- sched_state.alu_pipe_scheduled_p = sched_state.pipeB_scheduled_p = 1;
+ sched_state.alu_pipe_scheduled_p = 1;
sched_state.cached_can_issue_more = 1;
return 1;
}
--
2.39.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. Can we evaluate the impact of this change on performance downstream & upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the following between unpatched and patched for downstream and upstream:
Downstream
Unpatched vs patched = +-0.15% faster
Upstream
Unpatched vs Patched = +-0.27% faster
Signed-off-by: Luis Silva <[email protected]>
MichielDerhaeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think we can start splitting this up in a patch series for review.
No description provided.