Description
When writing optimization-friendly code, sometimes it might seem like a good idea to unroll branches by hand by performing multi-step computations optimistically, while keeping tabs on possible failures of intermediate steps as booleans. These failure flags are then combined using bitwise logic when the validity of the result is finally checked. Here's an example of how it might work with checked arithmetics, inspired by some code in std::hash
:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
However, in optimized code generation at least on x86-64, the bitwise OR for booleans is sometimes decomposed into a series of checks and branches, defeating the whole purpose. Here's a condensed benchmark comparing boolean OR with integer bitwise OR, where the results of both are used as the condition for branching.
#![feature(test)]
extern crate test;
use test::Bencher;
#[inline(never)]
fn or_bools(a: bool, b: bool, c: bool) -> Option<u64> {
if a | b | c { Some(1) } else { None }
}
#[inline(never)]
fn or_bytes(a: u8, b: u8, c: u8) -> Option<u64> {
if (a | b | c) != 0 { Some(1) } else { None }
}
#[bench]
fn bench_or_bools(b: &mut Bencher) {
const DATA: [(bool, bool, bool); 4]
= [(false, false, false),
(true , false, false),
(false, true , false),
(false, false, true)];
b.iter(|| {
for i in 0 .. 4 {
let (a, b, c) = DATA[i];
test::black_box(or_bools(a, b, c));
}
})
}
#[bench]
fn bench_or_bytes(b: &mut Bencher) {
const DATA: [(u8, u8, u8); 4]
= [(0u8, 0u8, 0u8),
(1u8, 0u8, 0u8),
(0u8, 1u8, 0u8),
(0u8, 0u8, 1u8)];
b.iter(|| {
for i in 0 .. 4 {
let (a, b, c) = DATA[i];
test::black_box(or_bytes(a, b, c));
}
})
}
The de-optimization looks like work of LLVM, as the IR for or_bools
preserves the original intent:
; Function Attrs: noinline norecurse nounwind uwtable
define internal fastcc void @_ZN8or_bools20h51bacbaed15b22f4gaaE(%"2.core::option::Option<u64>"* noalias nocapture dereferenceable(16), i1 zeroext, i1 zeroext, i1 zeroext) unnamed_addr #0 {
entry-block:
%4 = or i1 %1, %2
%5 = or i1 %4, %3
%6 = bitcast %"2.core::option::Option<u64>"* %0 to i8*
br i1 %5, label %then-block-26-, label %else-block
then-block-26-: ; preds = %entry-block
tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, i64, [0 x i8] }* @const5784 to i8*), i64 16, i32 8, i1 false)
br label %join
else-block: ; preds = %entry-block
tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, [8 x i8] }* @const5785 to i8*), i64 16, i32 8, i1 false)
br label %join
join: ; preds = %else-block, %then-block-26-
ret void
}
Activity
mzabaluev commentedon Mar 21, 2016
This probably should have been filed straight for LLVM, but I'm intimidated by their Bugzilla.
Use checked_* arithmetic methods
eefriedman commentedon Mar 24, 2016
Long discussion on LLVM bugtracker: https://llvm.org/bugs/show_bug.cgi?id=23827 .
mzabaluev commentedon Mar 24, 2016
@eefriedman, thanks. As I understood it, the relative performance of branched vs branchless code to evaluate a boolean-ish expression tree depends on the predictability of branch predicates, and branched code may be better when: 1) the branches are highly predictable; 2) earlier ones tend to be taken often, short-circuiting the evaluation. In the specific case of checked arithmetics, the branches are very predictable (no overflows occur normally), but an evaluation normally has to check all predicates. I guess it's rather a matter of hinting the compiler on the probability of branches, the support for which is yet to be implemented (#26179).
Mark-Simulacrum commentedon May 2, 2017
The LLVM bug was closed as invalid, and unstable expect/likely intrinsic is now implemented judging by the tracking issue. I'm going to close this, please ask us to reopen if you disagree.