Skip to content

[.integer64 performance regression in >=4.5.2 #176

@egillax

Description

@egillax

I'm running into an issue where code that ran previously becomes so slow it doesn't finish. I've traced the slowness back to the function .[integer64 in this package.

bit64/R/integer64.R

Lines 886 to 905 in 62cd4ee

`[.integer64` <- function(x, i, ...) {
cl <- oldClass(x)
ret <- NextMethod()
# Begin NA-handling from Leonardo Silvestri
if (!missing(i)) {
if (inherits(i, "character")) {
na_idx <- union(which(!(i %in% names(x))), which(is.na(i)))
if (length(na_idx))
ret[na_idx] <- NA_integer64_
} else {
na_idx <- is.na(rep(TRUE, length(x))[i])
if (any(na_idx))
ret[na_idx] <- NA_integer64_
}
}
# End NA-handling from Leonardo Silvestri
oldClass(ret) <- cl
remcache(ret)
ret
}

Or more concretely to this line:

na_idx <- is.na(rep(TRUE, length(x))[i])

In my case I have a duckdb database I'm using DBI to fetch results of a query in batches. In my case i is 100k integers and x is almost 500 million integer64's. Materializing a vector of size x seems to be the issue. Using tag 4.0.5 does not have this issue while 4.5.2 and later have it. From the changelog it there was a bugfix I think is responsible: ""[.integer64"(x,i) can now cope with i longer than x"

The following code should demonstrate the difference when running with tags 4.0.5 and 4.6.0-1

# remotes::install_github("r-lib/[email protected]") or remotes::install_github("r-lib/[email protected]")
library(bit64)
x <- as.integer64(rep(1, 1e8))
i <- sample(c(NA, 1:1e4, 1e7 + 1), size = 1e5, replace = TRUE)
microbenchmark(
  bit64 = x[i],
  times = 10L
)

4.0.5:

Unit: microseconds
  expr     min      lq     mean  median      uq      max neval
 bit64 672.959 894.574 916.0243 945.992 961.139 1027.361    10

4.6.0-1:

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
 bit64 194.0301 200.1372 213.4576 204.3696 233.6707 235.2293    10

current main:

Unit: milliseconds
  expr      min       lq     mean  median       uq      max neval
 bit64 193.7276 200.4751 214.2612 205.575 232.2011 238.9536    10

Roughly a difference of ~333x. I thought about replacing the offending line with

 na_idx <- is.na(i) | (i > length(x))

But unfortunately that doesn't work when i is boolean and length(i) > length(x) (untested case btw).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions