Skip to content

Conditions using bitwise OR on booleans may produce de-optimized code #32414

Closed
@mzabaluev

Description

@mzabaluev
Contributor

When writing optimization-friendly code, sometimes it might seem like a good idea to unroll branches by hand by performing multi-step computations optimistically, while keeping tabs on possible failures of intermediate steps as booleans. These failure flags are then combined using bitwise logic when the validity of the result is finally checked. Here's an example of how it might work with checked arithmetics, inspired by some code in std::hash:

fn calculate_size(elem_size: usize,
                  length: usize,
                  offset: usize)
                  -> Option<usize> {
    let (acc, oflo1) = elem_size.overflowing_mul(length);
    let (acc, oflo2) = acc.overflowing_add(offset);
    if oflo1 | oflo2 {
        None
    } else {
        Some(acc)
    }
}

However, in optimized code generation at least on x86-64, the bitwise OR for booleans is sometimes decomposed into a series of checks and branches, defeating the whole purpose. Here's a condensed benchmark comparing boolean OR with integer bitwise OR, where the results of both are used as the condition for branching.

#![feature(test)]

extern crate test;

use test::Bencher;

#[inline(never)]
fn or_bools(a: bool, b: bool, c: bool) -> Option<u64> {
    if a | b | c { Some(1) } else { None }
}

#[inline(never)]
fn or_bytes(a: u8, b: u8, c: u8) -> Option<u64> {
    if (a | b | c) != 0 { Some(1) } else { None }
}

#[bench]
fn bench_or_bools(b: &mut Bencher) {
    const DATA: [(bool, bool, bool); 4]
              = [(false, false, false),
                 (true , false, false),
                 (false, true , false),
                 (false, false, true)];
    b.iter(|| {
        for i in 0 .. 4 {
            let (a, b, c) = DATA[i];
            test::black_box(or_bools(a, b, c));
        }
    })
}

#[bench]
fn bench_or_bytes(b: &mut Bencher) {
    const DATA: [(u8, u8, u8); 4]
              = [(0u8, 0u8, 0u8),
                 (1u8, 0u8, 0u8),
                 (0u8, 1u8, 0u8),
                 (0u8, 0u8, 1u8)];
    b.iter(|| {
        for i in 0 .. 4 {
            let (a, b, c) = DATA[i];
            test::black_box(or_bytes(a, b, c));
        }
    })
}

The de-optimization looks like work of LLVM, as the IR for or_bools preserves the original intent:

; Function Attrs: noinline norecurse nounwind uwtable
define internal fastcc void @_ZN8or_bools20h51bacbaed15b22f4gaaE(%"2.core::option::Option<u64>"* noalias nocapture dereferenceable(16), i1 zeroext, i1 zeroext, i1 zeroext) unnamed_addr #0 {
entry-block:
  %4 = or i1 %1, %2
  %5 = or i1 %4, %3
  %6 = bitcast %"2.core::option::Option<u64>"* %0 to i8*
  br i1 %5, label %then-block-26-, label %else-block

then-block-26-:                                   ; preds = %entry-block
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, i64, [0 x i8] }* @const5784 to i8*), i64 16, i32 8, i1 false)
  br label %join

else-block:                                       ; preds = %entry-block
  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* nonnull bitcast ({ i64, [8 x i8] }* @const5785 to i8*), i64 16, i32 8, i1 false)
  br label %join

join:                                             ; preds = %else-block, %then-block-26-
  ret void
}

Activity

mzabaluev

mzabaluev commented on Mar 21, 2016

@mzabaluev
ContributorAuthor

This probably should have been filed straight for LLVM, but I'm intimidated by their Bugzilla.

added a commit that references this issue on Mar 22, 2016
eefriedman

eefriedman commented on Mar 24, 2016

@eefriedman
Contributor

Long discussion on LLVM bugtracker: https://llvm.org/bugs/show_bug.cgi?id=23827 .

mzabaluev

mzabaluev commented on Mar 24, 2016

@mzabaluev
ContributorAuthor

@eefriedman, thanks. As I understood it, the relative performance of branched vs branchless code to evaluate a boolean-ish expression tree depends on the predictability of branch predicates, and branched code may be better when: 1) the branches are highly predictable; 2) earlier ones tend to be taken often, short-circuiting the evaluation. In the specific case of checked arithmetics, the branches are very predictable (no overflows occur normally), but an evaluation normally has to check all predicates. I guess it's rather a matter of hinting the compiler on the probability of branches, the support for which is yet to be implemented (#26179).

added
I-slowIssue: Problems and improvements with respect to performance of generated code.
on Mar 24, 2016
Mark-Simulacrum

Mark-Simulacrum commented on May 2, 2017

@Mark-Simulacrum
Member

The LLVM bug was closed as invalid, and unstable expect/likely intrinsic is now implemented judging by the tracking issue. I'm going to close this, please ask us to reopen if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    I-slowIssue: Problems and improvements with respect to performance of generated code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @steveklabnik@mzabaluev@Mark-Simulacrum@eefriedman

        Issue actions

          Conditions using bitwise OR on booleans may produce de-optimized code · Issue #32414 · rust-lang/rust