Skip to content

[AArch64] Add support for -mlong-calls code generation #142982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dongjianqiang2
Copy link
Contributor

This patch implements backend support for -mlong-calls on AArch64 targets. When enabled, calls to external functions are lowered to an indirect call via an address computed using adrp and add rather than a direct bl instruction, which is limited to a ±128MB PC-relative offset.

This is particularly useful when code and/or data exceeds the 26-bit immediate range of bl, such as in large binaries or link-time-optimized builds.

Key changes:

  • In SelectionDAG lowering (LowerCall), detect -mlong-calls and emit:
    • adrp + add address calculation
    • blr indirect call instruction

This patch ensures that long-calls are emitted correctly for both GlobalAddress and ExternalSymbol call targets.

Tested:

  • New codegen tests under llvm/test/CodeGen/AArch64/aarch64-long-calls.ll
  • Verified adrp + add + blr output in .s for global and external functions

@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AArch64 clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Jun 5, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 5, 2025

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-aarch64

Author: dong jianqiang (dongjianqiang2)

Changes

This patch implements backend support for -mlong-calls on AArch64 targets. When enabled, calls to external functions are lowered to an indirect call via an address computed using adrp and add rather than a direct bl instruction, which is limited to a ±128MB PC-relative offset.

This is particularly useful when code and/or data exceeds the 26-bit immediate range of bl, such as in large binaries or link-time-optimized builds.

Key changes:

  • In SelectionDAG lowering (LowerCall), detect -mlong-calls and emit:
    • adrp + add address calculation
    • blr indirect call instruction

This patch ensures that long-calls are emitted correctly for both GlobalAddress and ExternalSymbol call targets.

Tested:

  • New codegen tests under llvm/test/CodeGen/AArch64/aarch64-long-calls.ll
  • Verified adrp + add + blr output in .s for global and external functions

Full diff: https://github.com/llvm/llvm-project/pull/142982.diff

4 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Arch/AArch64.cpp (+6)
  • (modified) llvm/lib/Target/AArch64/AArch64Features.td (+4)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+10-3)
  • (added) llvm/test/CodeGen/AArch64/aarch64-long-calls.ll (+26)
diff --git a/clang/lib/Driver/ToolChains/Arch/AArch64.cpp b/clang/lib/Driver/ToolChains/Arch/AArch64.cpp
index eaae9f876e3ad..2463bcdae2f4f 100644
--- a/clang/lib/Driver/ToolChains/Arch/AArch64.cpp
+++ b/clang/lib/Driver/ToolChains/Arch/AArch64.cpp
@@ -466,6 +466,12 @@ void aarch64::getAArch64TargetFeatures(const Driver &D,
 
   if (Args.getLastArg(options::OPT_mno_bti_at_return_twice))
     Features.push_back("+no-bti-at-return-twice");
+
+  if (Arg *A = Args.getLastArg(options::OPT_mlong_calls,
+                               options::OPT_mno_long_calls)) {
+    if (A->getOption().matches(options::OPT_mlong_calls))
+      Features.push_back("+long-calls");
+  }
 }
 
 void aarch64::setPAuthABIInTriple(const Driver &D, const ArgList &Args,
diff --git a/llvm/lib/Target/AArch64/AArch64Features.td b/llvm/lib/Target/AArch64/AArch64Features.td
index 469c76752c78c..5af6ed5f1ffa2 100644
--- a/llvm/lib/Target/AArch64/AArch64Features.td
+++ b/llvm/lib/Target/AArch64/AArch64Features.td
@@ -825,6 +825,10 @@ def FeatureDisableFastIncVL : SubtargetFeature<"disable-fast-inc-vl",
                                                "HasDisableFastIncVL", "true",
                                                "Do not prefer INC/DEC, ALL, { 1, 2, 4 } over ADDVL">;
 
+def FeatureLongCalls : SubtargetFeature<"long-calls", "GenLongCalls", "true",
+                                        "Generate calls via indirect call "
+                                        "instructions">;
+
 //===----------------------------------------------------------------------===//
 // Architectures.
 //
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 9f51caef6d228..d6015ccf94afc 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -9286,8 +9286,12 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
       Callee = DAG.getTargetGlobalAddress(CalledGlobal, DL, PtrVT, 0, OpFlags);
       Callee = DAG.getNode(AArch64ISD::LOADgot, DL, PtrVT, Callee);
     } else {
-      const GlobalValue *GV = G->getGlobal();
-      Callee = DAG.getTargetGlobalAddress(GV, DL, PtrVT, 0, OpFlags);
+      if (Subtarget->genLongCalls())
+        Callee = getAddr(G, DAG, OpFlags);
+      else {
+        const GlobalValue *GV = G->getGlobal();
+        Callee = DAG.getTargetGlobalAddress(GV, DL, PtrVT, 0, OpFlags);
+      }
     }
   } else if (auto *S = dyn_cast<ExternalSymbolSDNode>(Callee)) {
     bool UseGot = (getTargetMachine().getCodeModel() == CodeModel::Large &&
@@ -9298,7 +9302,10 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
       Callee = DAG.getTargetExternalSymbol(Sym, PtrVT, AArch64II::MO_GOT);
       Callee = DAG.getNode(AArch64ISD::LOADgot, DL, PtrVT, Callee);
     } else {
-      Callee = DAG.getTargetExternalSymbol(Sym, PtrVT, 0);
+      if (Subtarget->genLongCalls())
+        Callee = getAddr(S, DAG, 0);
+      else
+        Callee = DAG.getTargetExternalSymbol(Sym, PtrVT, 0);
     }
   }
 
diff --git a/llvm/test/CodeGen/AArch64/aarch64-long-calls.ll b/llvm/test/CodeGen/AArch64/aarch64-long-calls.ll
new file mode 100644
index 0000000000000..cb41c3cf519e0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/aarch64-long-calls.ll
@@ -0,0 +1,26 @@
+; RUN: llc -O2 -mtriple=aarch64-linux-gnu -mcpu=generic -mattr=+long-calls < %s | FileCheck %s
+
+declare void @far_func()
+declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)
+
+define void @test() {
+entry:
+  call void @far_func()
+  ret void
+}
+
+define void @test2(ptr %dst, i8 %val, i64 %len) {
+entry:
+  call void @llvm.memset.p0.i64(ptr %dst, i8 %val, i64 %len, i1 false)
+  ret void
+}
+
+; CHECK-LABEL: test:
+; CHECK: adrp {{x[0-9]+}}, far_func
+; CHECK: add {{x[0-9]+}}, {{x[0-9]+}}, :lo12:far_func
+; CHECK: blr {{x[0-9]+}}
+
+; CHECK-LABEL: test2:
+; CHECK: adrp {{x[0-9]+}}, memset
+; CHECK: add {{x[0-9]+}}, {{x[0-9]+}}, :lo12:memset
+; CHECK: blr {{x[0-9]+}}

@llvmbot
Copy link
Member

llvmbot commented Jun 5, 2025

@llvm/pr-subscribers-clang-driver

Author: dong jianqiang (dongjianqiang2)

Changes

This patch implements backend support for -mlong-calls on AArch64 targets. When enabled, calls to external functions are lowered to an indirect call via an address computed using adrp and add rather than a direct bl instruction, which is limited to a ±128MB PC-relative offset.

This is particularly useful when code and/or data exceeds the 26-bit immediate range of bl, such as in large binaries or link-time-optimized builds.

Key changes:

  • In SelectionDAG lowering (LowerCall), detect -mlong-calls and emit:
    • adrp + add address calculation
    • blr indirect call instruction

This patch ensures that long-calls are emitted correctly for both GlobalAddress and ExternalSymbol call targets.

Tested:

  • New codegen tests under llvm/test/CodeGen/AArch64/aarch64-long-calls.ll
  • Verified adrp + add + blr output in .s for global and external functions

Full diff: https://github.com/llvm/llvm-project/pull/142982.diff

4 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Arch/AArch64.cpp (+6)
  • (modified) llvm/lib/Target/AArch64/AArch64Features.td (+4)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+10-3)
  • (added) llvm/test/CodeGen/AArch64/aarch64-long-calls.ll (+26)
diff --git a/clang/lib/Driver/ToolChains/Arch/AArch64.cpp b/clang/lib/Driver/ToolChains/Arch/AArch64.cpp
index eaae9f876e3ad..2463bcdae2f4f 100644
--- a/clang/lib/Driver/ToolChains/Arch/AArch64.cpp
+++ b/clang/lib/Driver/ToolChains/Arch/AArch64.cpp
@@ -466,6 +466,12 @@ void aarch64::getAArch64TargetFeatures(const Driver &D,
 
   if (Args.getLastArg(options::OPT_mno_bti_at_return_twice))
     Features.push_back("+no-bti-at-return-twice");
+
+  if (Arg *A = Args.getLastArg(options::OPT_mlong_calls,
+                               options::OPT_mno_long_calls)) {
+    if (A->getOption().matches(options::OPT_mlong_calls))
+      Features.push_back("+long-calls");
+  }
 }
 
 void aarch64::setPAuthABIInTriple(const Driver &D, const ArgList &Args,
diff --git a/llvm/lib/Target/AArch64/AArch64Features.td b/llvm/lib/Target/AArch64/AArch64Features.td
index 469c76752c78c..5af6ed5f1ffa2 100644
--- a/llvm/lib/Target/AArch64/AArch64Features.td
+++ b/llvm/lib/Target/AArch64/AArch64Features.td
@@ -825,6 +825,10 @@ def FeatureDisableFastIncVL : SubtargetFeature<"disable-fast-inc-vl",
                                                "HasDisableFastIncVL", "true",
                                                "Do not prefer INC/DEC, ALL, { 1, 2, 4 } over ADDVL">;
 
+def FeatureLongCalls : SubtargetFeature<"long-calls", "GenLongCalls", "true",
+                                        "Generate calls via indirect call "
+                                        "instructions">;
+
 //===----------------------------------------------------------------------===//
 // Architectures.
 //
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 9f51caef6d228..d6015ccf94afc 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -9286,8 +9286,12 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
       Callee = DAG.getTargetGlobalAddress(CalledGlobal, DL, PtrVT, 0, OpFlags);
       Callee = DAG.getNode(AArch64ISD::LOADgot, DL, PtrVT, Callee);
     } else {
-      const GlobalValue *GV = G->getGlobal();
-      Callee = DAG.getTargetGlobalAddress(GV, DL, PtrVT, 0, OpFlags);
+      if (Subtarget->genLongCalls())
+        Callee = getAddr(G, DAG, OpFlags);
+      else {
+        const GlobalValue *GV = G->getGlobal();
+        Callee = DAG.getTargetGlobalAddress(GV, DL, PtrVT, 0, OpFlags);
+      }
     }
   } else if (auto *S = dyn_cast<ExternalSymbolSDNode>(Callee)) {
     bool UseGot = (getTargetMachine().getCodeModel() == CodeModel::Large &&
@@ -9298,7 +9302,10 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
       Callee = DAG.getTargetExternalSymbol(Sym, PtrVT, AArch64II::MO_GOT);
       Callee = DAG.getNode(AArch64ISD::LOADgot, DL, PtrVT, Callee);
     } else {
-      Callee = DAG.getTargetExternalSymbol(Sym, PtrVT, 0);
+      if (Subtarget->genLongCalls())
+        Callee = getAddr(S, DAG, 0);
+      else
+        Callee = DAG.getTargetExternalSymbol(Sym, PtrVT, 0);
     }
   }
 
diff --git a/llvm/test/CodeGen/AArch64/aarch64-long-calls.ll b/llvm/test/CodeGen/AArch64/aarch64-long-calls.ll
new file mode 100644
index 0000000000000..cb41c3cf519e0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/aarch64-long-calls.ll
@@ -0,0 +1,26 @@
+; RUN: llc -O2 -mtriple=aarch64-linux-gnu -mcpu=generic -mattr=+long-calls < %s | FileCheck %s
+
+declare void @far_func()
+declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)
+
+define void @test() {
+entry:
+  call void @far_func()
+  ret void
+}
+
+define void @test2(ptr %dst, i8 %val, i64 %len) {
+entry:
+  call void @llvm.memset.p0.i64(ptr %dst, i8 %val, i64 %len, i1 false)
+  ret void
+}
+
+; CHECK-LABEL: test:
+; CHECK: adrp {{x[0-9]+}}, far_func
+; CHECK: add {{x[0-9]+}}, {{x[0-9]+}}, :lo12:far_func
+; CHECK: blr {{x[0-9]+}}
+
+; CHECK-LABEL: test2:
+; CHECK: adrp {{x[0-9]+}}, memset
+; CHECK: add {{x[0-9]+}}, {{x[0-9]+}}, :lo12:memset
+; CHECK: blr {{x[0-9]+}}

@hstk30-hw hstk30-hw requested review from rj-jesus, sjoerdmeijer, MaskRay and jmolloy and removed request for rj-jesus June 5, 2025 14:53
@smithp35
Copy link
Collaborator

smithp35 commented Jun 5, 2025

My understanding is that this will make all calls to global functions into long calls.

In AArch64 static linkes are required to insert range extension thunks for out of range BLs. In the best case this is just another direct branch, at worst case for --pic-veneer this is just adrp, add, br. I would expect that on-demand linker inserted thunks would outperform making all calls long for the majority of programs. I'm interested in any data that shows that long calls works better, and whether that could feed back into the lld thunk generation code. For example are the thunks too far away from the caller which causes page faults etc.

I note that with -ffunction-sections and certain linker options calls to static functions can go out of range too. These would get handled by linker thunks though.

@dongjianqiang2
Copy link
Contributor Author

My understanding is that this will make all calls to global functions into long calls.

In AArch64 static linkes are required to insert range extension thunks for out of range BLs. In the best case this is just another direct branch, at worst case for --pic-veneer this is just adrp, add, br. I would expect that on-demand linker inserted thunks would outperform making all calls long for the majority of programs. I'm interested in any data that shows that long calls works better, and whether that could feed back into the lld thunk generation code. For example are the thunks too far away from the caller which causes page faults etc.

I note that with -ffunction-sections and certain linker options calls to static functions can go out of range too. These would get handled by linker thunks though.

This option is explicitly designed to ‌enable reliable patching workflows‌ when compiling object files. It is to guarantee call range safety in patches‌. When modifying/recompiling ‌individual object files‌ (e.g., during security patches),final memory layouts are ‌unknown at compile time‌, patched functions might end up >128MB away from callers. -mlong-calls forces all cross-object calls to use ‌64-bit absolute addressing‌.

@MaskRay
Copy link
Member

MaskRay commented Jun 5, 2025

-mlong-calls is an old-fashioned compiler option. I think it was added before linkers knew range extension thunks (aka stubs, veneers, etc).

Can you use -fno-plt instead? It works with both SelectionDAG and GlobalISel. You will get GOT-generating code sequence that can be optimized to adrp+add by the linker.
You can use --emit-relocs to get relocations in the executable.
We could implement __attribute__((noplt)), if you want the patching to be per-function.

The proposed -mlong-calls is -fno-pic hack that works with limited scenarios with a large performance downside. I don't think we should support it.

@smithp35
Copy link
Collaborator

smithp35 commented Jun 5, 2025

My understanding is that this will make all calls to global functions into long calls.
In AArch64 static linkes are required to insert range extension thunks for out of range BLs. In the best case this is just another direct branch, at worst case for --pic-veneer this is just adrp, add, br. I would expect that on-demand linker inserted thunks would outperform making all calls long for the majority of programs. I'm interested in any data that shows that long calls works better, and whether that could feed back into the lld thunk generation code. For example are the thunks too far away from the caller which causes page faults etc.
I note that with -ffunction-sections and certain linker options calls to static functions can go out of range too. These would get handled by linker thunks though.

This option is explicitly designed to ‌enable reliable patching workflows‌ when compiling object files. It is to guarantee call range safety in patches‌. When modifying/recompiling ‌individual object files‌ (e.g., during security patches),final memory layouts are ‌unknown at compile time‌, patched functions might end up >128MB away from callers. -mlong-calls forces all cross-object calls to use ‌64-bit absolute addressing‌.

If I've understood object patching, this would mean inserting a new function implementation, and binary patching all the call-sites to point to the new implementation.

As an aside to this patch.

I'd be tempted to see if I could indirect all the calls via the PLT. Then you'd be able add the new function and alter the dynamic symbol table entry to point to the new implementation and the dynamic linker would do the rest. That might need some fiddling in the linker or compiler driver to force it to create a PLT entry, --shared would do it, but for an executable we'd need a PT_INTERPRET section.

There was a Discourse thread on ROM Patching for embedded systems https://discourse.llvm.org/t/rfc-a-user-guided-rom-patching-mechanism-for-embedded-applications/78467 which had a similar idea.

@dongjianqiang2
Copy link
Contributor Author

-mlong-calls is an old-fashioned compiler option. I think it was added before linkers knew range extension thunks (aka stubs, veneers, etc).

Can you use -fno-plt instead? It works with both SelectionDAG and GlobalISel. You will get GOT-generating code sequence that can be optimized to adrp+add by the linker. You can use --emit-relocs to get relocations in the executable. We could implement __attribute__((noplt)), if you want the patching to be per-function.

The proposed -mlong-calls is -fno-pic hack that works with limited scenarios with a large performance downside. I don't think we should support it.

Yes, we are indeed still using the -mlong-calls option in our older embedded systems. This is necessary due to the lack of support for GOT-based relocation types in these environments. As a result, we have incorporated this option to ensure compatibility and functionality.

Moving forward, it's important of adding support in SelectionDAG and GlobalISel for these scenarios.

Copy link

github-actions bot commented Jun 13, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

This patch implements backend support for -mlong-calls on AArch64 targets.
When enabled, calls to external functions are lowered to an indirect call via
an address computed using `adrp` and `add` rather than a direct `bl` instruction,
which is limited to a ±128MB PC-relative offset.

This is particularly useful when code and/or data exceeds the 26-bit immediate
range of `bl`, such as in large binaries or link-time-optimized builds.

Key changes:
- In SelectionDAG lowering (`LowerCall`), detect `-mlong-calls` and emit:
    - `adrp + add` address calculation
    - `blr` indirect call instruction

This patch ensures that long-calls are emitted correctly for both GlobalAddress
and ExternalSymbol call targets.

Tested:
- New codegen tests under `llvm/test/CodeGen/AArch64/aarch64-long-calls.ll`
- Verified `adrp + add + blr` output in `.s` for global and external functions
@dongjianqiang2
Copy link
Contributor Author

My understanding is that this will make all calls to global functions into long calls.
In AArch64 static linkes are required to insert range extension thunks for out of range BLs. In the best case this is just another direct branch, at worst case for --pic-veneer this is just adrp, add, br. I would expect that on-demand linker inserted thunks would outperform making all calls long for the majority of programs. I'm interested in any data that shows that long calls works better, and whether that could feed back into the lld thunk generation code. For example are the thunks too far away from the caller which causes page faults etc.
I note that with -ffunction-sections and certain linker options calls to static functions can go out of range too. These would get handled by linker thunks though.

This option is explicitly designed to ‌enable reliable patching workflows‌ when compiling object files. It is to guarantee call range safety in patches‌. When modifying/recompiling ‌individual object files‌ (e.g., during security patches),final memory layouts are ‌unknown at compile time‌, patched functions might end up >128MB away from callers. -mlong-calls forces all cross-object calls to use ‌64-bit absolute addressing‌.

If I've understood object patching, this would mean inserting a new function implementation, and binary patching all the call-sites to point to the new implementation.

As an aside to this patch.

I'd be tempted to see if I could indirect all the calls via the PLT. Then you'd be able add the new function and alter the dynamic symbol table entry to point to the new implementation and the dynamic linker would do the rest. That might need some fiddling in the linker or compiler driver to force it to create a PLT entry, --shared would do it, but for an executable we'd need a PT_INTERPRET section.

There was a Discourse thread on ROM Patching for embedded systems https://discourse.llvm.org/t/rfc-a-user-guided-rom-patching-mechanism-for-embedded-applications/78467 which had a similar idea.

Thanks @smithp35 for your solution! I would like to kindly ask for your expertise in reviewing the following code, which implements backend support for -mlong-calls on AArch64 targets. It might not need to be merged, just considering it as an optional approach.
Thank you once again for your time and consideration.

@smithp35
Copy link
Collaborator

Thanks @smithp35 for your solution! I would like to kindly ask for your expertise in reviewing the following code, which implements backend support for -mlong-calls on AArch64 targets. It might not need to be merged, just considering it as an optional approach.
Thank you once again for your time and consideration.

I'm mostly a linker/ABI person so I'm not much of an expert in code-generation, if I can find some time I can check to see if I can spot any obvious mistakes. The thing I'd want to check for is that the rest of the backend has recorded these additional indirect calls. I'm thinking in particular of BTI which the compiler can sometimes omit when it can show there are no indirect calls to a symbol.

I can't help with this being merged as this is a maintainers call. Even if the code is correct today, it will need to be maintained and future changes/transformations will need to make sure it doesn't break. The maintainers have to decide whether the use case is worth it for a wide-variety of use cases or whether it should be a downstream change.

Yes, we are indeed still using the -mlong-calls option in our older embedded systems. This is necessary due to the lack of support for GOT-based relocation types in these environments. As a result, we have incorporated this option to ensure compatibility and functionality.

The two linkers I'm most familiar with lld and Arm's proprietary linker armlink will statically resolve the GOT relocations when doing a static link, I would expect GNU ld to do this too. LLD will even transform the GOT access to a PC-relative one when the definition is local https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#579relocation-optimization .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants