Description
When writing optimization-friendly code, sometimes it might seem like a good idea to unroll branches by hand by performing multi-step computations optimistically, while keeping tabs on possible failures of intermediate steps as booleans. These failure flags are then combined using bitwise logic when the validity of the result is finally checked. Here's an example of how it might work with checked arithmetics, inspired by some code in std::hash
:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
However, in optimized code generation at least on x86-64, the bitwise OR for booleans is sometimes decomposed into a series of checks and branches, defeating the whole purpose. Here's a condensed benchmark comparing boolean OR with integer bitwise OR, where the results of both are used as the condition for branching.
#![feature(test)]
extern crate test;
use test::Bencher;
#[inline(never)]
fn or_bools(a: bool, b: bool, c: bool) -> Option<u64> {
if a | b | c { Some(1) } else { None }
}
#[inline(never)]
fn or_bytes(a: u8, b: u8, c: u8) -> Option<u64> {
if (a | b | c) != 0 { Some(1) } else { None }
}
#[bench]
fn bench_or_bools(b: &mut Bencher) {
const DATA: [(bool, bool, bool); 4]
= [(false, false, false),
(true , false, false),
(false, true , false),
(false, false, true)];
b.iter(|| {
for i in 0 .. 4 {
let (a, b, c) = DATA[i];
test::black_box(or_bools(a, b, c));
}
})
}
#[bench]
fn bench_or_bytes(b: &mut Bencher) {
const DATA: [(u8, u8, u8); 4]
= [(0u8, 0u8, 0u8),
(1u8, 0u8, 0u8),
(0u8, 1u8, 0u8),
(0u8, 0u8, 1u8)];
b.iter(|| {
for i in 0 .. 4 {
let (a, b, c) = DATA[i];
test::black_box(or_bytes(a, b, c));
}
})
}
The de-optimization looks like work of LLVM, as the IR for or_bools
preserves the original intent:
; Function Attrs: noinline norecurse nounwind uwtable
define internal fastcc void @_ZN8or_bools20h51bacbaed15b22f4gaaE(%"2.core::option::Option<u64>"* noalias nocapture dereferenceable(16), i1 zeroext, i1 zeroext, i1 zeroext) unnamed_addr #0 {
entry-block:
%4 = or i1 %1, %2
%5 = or i1 %4, %3
%6 = bitcast %"2.core::option::Option<u64>"* %0 to i8*
br i1 %5, label %then-block-26-, label %else-block
then-block-26-: ; preds = %entry-block
tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, i64, [0 x i8] }* @const5784 to i8*), i64 16, i32 8, i1 false)
br label %join
else-block: ; preds = %entry-block
tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, [8 x i8] }* @const5785 to i8*), i64 16, i32 8, i1 false)
br label %join
join: ; preds = %else-block, %then-block-26-
ret void
}