-
Notifications
You must be signed in to change notification settings - Fork 15
Upstream ARC-V RHX-100 #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
MichielDerhaeg
wants to merge
17
commits into
michiel/upstream_rmx100
Choose a base branch
from
michiel/upstream_rhx100
base: michiel/upstream_rmx100
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Claudiu Zissulescu <[email protected]>
For the RMX-500 and RHX cores, the sequence "load-immediate + store" (that is used to store a constant value) can be executed in 1 cycle, provided the two instructions are kept next to one another. This patch handles this case in riscv_macro_fusion_pair_p(). Signed-off-by: Artemiy Volkov <[email protected]>
ARC-V related optimisations must be guarded like:
if (riscv_microarchitecture == <arch>) {
...
}
Introduce an inline function that encapsulates this:
static inline bool riscv_is_micro_arch (<arch>)
Use it to define __riscv_rhx whenever compiling for the RHX
microarchitecture.
Signed-off-by: Shahab Vahedi <[email protected]>
With this commit, we allow a load-immediate to be macro-op fused with a successive conditional branch that is dependent on it, e.g.: li t0, #imm bge t1, t0, .label Additionally, add a new testcase to check that this fusion type is handled correctly. Signed-off-by: Artemiy Volkov <[email protected]>
To take better advantage of double load/store fusion, make use of the sched_fusion pass that assigns unique "fusion priorities" to load/store instructions and schedules operations on adjacent addresses together. This maximizes the probability that loads/stores are fused between each other instead of with other instructions. Signed-off-by: Artemiy Volkov <[email protected]>
With this patch, arcv_macro_fusion_pair_p () recognizes instruction pairs like: LOAD rd1, [rs1,offset] add/sub rd2, rs1, rs2/imm (where all regs are distinct) and: STORE rs2, [rs1,offset] add/sub rd, rs1, rs2/imm as fused macro-op pairs. In the case of a load, rd1 being equal to rd2, rs1, or rs2 would lead to data hazards, hence this is disallowed; for stores, rs1 and rs2 of the two instructions must match. Signed-off-by: Artemiy Volkov <[email protected]>
Fuse together instruction pairs such as: LOAD rd1, [rs1,offset] lui rd2, imm (where rd1 and rd2 are distinct) and: STORE rs2, [rs1,offset] lui rd, imm Signed-off-by: Artemiy Volkov <[email protected]>
The RHX core executes integer multiply-add sequences of the form:
mul r1,r2,r3
add r1,r1,r4
in 1 cycle due to macro-op fusion. This patch adds a
define_insn_and_split to recognize the above sequence and preserve it as
a single insn up until the post-reload split pass.
Since, due to a microarchitectural restriction, the output operand of
both instructions must be the same register, the insn_and_split pattern
has two alternatives corresponding to the following cases: (0) r1 is
different from r4, in which case the insn can be split to the sequence
above; (1) r1 and r4 are the same, in which case a temporary register
has to be used and there is no fusion. Alternative (1) is discouraged
so that reload maximizes the number of instances where MAC fusion can be
applied. Since RHX is a rv32im core, the pattern requires that the
target is 32-bit and supports multiplication.
In addition, the {u,}maddhisi3 expand is implemented for RHX to
convert the ( 16-bit x 16-bit + 32_bit ) WIDEN_MULT_PLUS_EXPR GIMPLE
operator to the aforementioned madd_split instruction directly.
Lastly, a very basic testcase is introduced to make sure that the
new patterns are sufficient to produce MAC-fusion-aware code.
No new dejagnu failures with RUNTESTFLAGS="CFLAGS_FOR_TARGET=-mtune=rhx
dg.exp".
Signed-off-by: Artemiy Volkov <[email protected]>
To make sure that the multiply-add pairs (split post-reload from the madd_split instruction) are not broken up by the sched2 pass, designate them as fusable in arcv_macro_fusion_pair_p (). Signed-off-by: Artemiy Volkov <[email protected]>
The bitfield zero_extract operation is normally expanded into an srai followed by an andi. (With the ZBS extension enabled, the special case of 1-bit zero-extract is implemented with the bexti insn.) However, since the RHX core can execute a shift-left and a shift-right of the same register in 1 cycle, we would prefer to emit those two instructions instead, and schedule them together so that macro fusion can take place. The required steps to achieve this are: (1) Create an insn_and_split that handles the zero_extract RTX; (2) Tell the combiner to use that split by lowering the cost of the zero_extract RTX when the target is the RHX core; (3) Designate the resulting slli + srli pair as fusable by the scheduler. Attached is a small testcase demonstrating the split, and that the bexti insn still takes priority over the shift pair. Signed-off-by: Artemiy Volkov <[email protected]>
Some fusion types (namely, LD/ST-OP/OPIMM and LD/ST-LUI) are available regardless of the order of instructions. To support this, extract the new arcv_memop_arith_pair_p () and arcv_memop_lui_pair_p () functions and call them twice. Signed-off-by: Artemiy Volkov <[email protected]>
This commit implements the scheduling model for the RHX-100 core. Among notable
things are:
(1) The arcv_macro_fusion_pair_p () hook has been modified to not create
SCHED_GROUP's larger than 2 instructions; also, it gives priority to double
load/store fusion, suppressing the other types until sched2.
(2) riscv_issue_rate () is set to 4 and the system is modeled as 4 separate
pipelines, giving access to as many instructions in ready_list as possible.
(3) The rhx.md description puts some initial constraints in place (e.g. memory
ops can only go into pipe B), saving some work in the reordering hook.
(4) The riscv_sched_variable_issue () and riscv_sched_reorder2 () hooks work
together to make sure (in order of descending priority) that:
(a) the critical path and the instruction priorities are respected;
(b) both pipes are filled (taking advantage of parallel dispatch within the
microarchitectural constraints);
(c) there is as much fusion going on as possible (and the existing fusion
pairs are not broken up).
There is probably some room for improvement, and some tweaks will probably have
to be made in response to HLA changes as the HW development process goes on.
Signed-off-by: Artemiy Volkov <[email protected]>
This patch implements riscv_sched_adjust_priority () for the RHX-100 microarchitecture by slightly bumping the priority of load/store pairs. As a consequence of this change, it becomes easier for riscv_sched_reorder2 () to schedule instructions in the memory pipe. Signed-off-by: Artemiy Volkov <[email protected]>
In addition to the LW+LW and SW+SW pairs that are already being recognized as macro-op-fusable, add support for 8-bit and naturally aligned 16-bit loads operating on adjacent memory locations. To that end, introduce the new microarch-specific pair_fusion_mode_allowed_p () predicate, and call it from fusion_load_store () during sched_fusion, and from arcv_macro_fusion_pair_p () during regular scheduling passes. Signed-off-by: Artemiy Volkov <[email protected]>
Currently on ARC-V, the maddhisi3 pattern always expands to the madd_split_fused instruction regardless of the target word size, which leads to the full-width mul and add instructions being emitted for 32-bit data even on riscv64: mul a6,a4,s6 add a6,a6,s7 sext.w s7,a6 To fix this, add another define_insn (madd_split_fused_extended) pattern wrapping the result of a MAC operation into a sign-extension from 32 to 64 bits, and use it in the (u)maddhisi3 expander in case of a 64-bit target. The assembly code after this change is more efficient, viz.: mulw a6,a4,s6 addw a6,a6,s7 Signed-off-by: Artemiy Volkov <[email protected]>
This define_insn_and_split prevents *zero_extract_fused from being selected. Updated the test. It succeeded despite the fused case not being selected because the right instructions were produced still. Signed-off-by: Michiel Derhaeg <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for taking the time to contribute to GCC! Please be advised that if you are
viewing this on
github.com, that the mirror there is unofficial and unmonitored.The GCC community does not use
github.comfor their contributions. Instead, we usea mailing list (
[email protected]) for code submissions, code reviews, andbug reports. Please send patches there instead.