|
| 1 | +--- |
| 2 | +title: "Libarrow binary features" |
| 3 | +description: > |
| 4 | + Understanding which C++ features are enabled in Arrow R package builds |
| 5 | +output: rmarkdown::html_vignette |
| 6 | +--- |
| 7 | + |
| 8 | +This document explains which C++ features are enabled in different Arrow R |
| 9 | +package build configurations, and documents the decisions behind our default |
| 10 | +feature set. This is intended as internal developer documentation for understanding |
| 11 | +which features are enabled in which builds. It is not intended to be a guide for |
| 12 | +installing the Arrow R package; for that, see the |
| 13 | +[installation guide](../../install.html). |
| 14 | + |
| 15 | +## Overview |
| 16 | + |
| 17 | +When the Arrow R package is installed, it needs a copy of the Arrow C++ library |
| 18 | +(libarrow). This can come from: |
| 19 | + |
| 20 | +1. **Prebuilt binaries** we host (for releases and nightlies) |
| 21 | +2. **Source builds** when binaries aren't available or users opt out |
| 22 | + |
| 23 | +The features available in libarrow depend on how it was built. This document |
| 24 | +covers the feature configuration for both scenarios. |
| 25 | + |
| 26 | +## Prebuilt libarrow binary configuration |
| 27 | + |
| 28 | +We produce prebuilt libarrow binaries for macOS, Windows, and Linux. These |
| 29 | +binaries include **more features** than the default source build to provide |
| 30 | +users with a fully-featured experience out of the box. |
| 31 | + |
| 32 | +### Current binary feature set |
| 33 | + |
| 34 | +| Platform | S3 | GCS | Configured in | |
| 35 | +|----------|----|----|---------------| |
| 36 | +| macOS (ARM64, x86_64) | ON | ON | `dev/tasks/r/github.packages.yml` | |
| 37 | +| Windows | ON | ON | `ci/scripts/PKGBUILD` | |
| 38 | +| Linux (x86_64) | ON | ON | `compose.yaml` (`ubuntu-cpp-static`) | |
| 39 | + |
| 40 | +### Exceptions to our build defaults |
| 41 | + |
| 42 | +Even though GCS defaults to OFF for source builds, we explicitly enable it in |
| 43 | +our prebuilt binaries because: |
| 44 | + |
| 45 | +1. **Binary users expect features to "just work"** - they shouldn't need to |
| 46 | + rebuild from source to access cloud storage |
| 47 | +2. **Build time is not a concern** - we build binaries once in CI, not on |
| 48 | + user machines |
| 49 | +3. **Parity across platforms** - users get the same features regardless of OS |
| 50 | + |
| 51 | + |
| 52 | +## Feature configuration in source builds of libarrow |
| 53 | + |
| 54 | +Source builds are controlled by `r/inst/build_arrow_static.sh`. The key |
| 55 | +environment variable is `LIBARROW_MINIMAL`: |
| 56 | + |
| 57 | +- `LIBARROW_MINIMAL` unset: Default feature set (Parquet, Dataset, JSON, common compression ON; S3/GCS/jemalloc OFF) |
| 58 | +- `LIBARROW_MINIMAL=false`: Full feature set (adds S3, jemalloc, additional compression) |
| 59 | +- `LIBARROW_MINIMAL=true`: Truly minimal (disables Parquet, Dataset, JSON, most compression, SIMD) |
| 60 | + |
| 61 | +### Features always enabled |
| 62 | + |
| 63 | +These features are always built regardless of `LIBARROW_MINIMAL`: |
| 64 | + |
| 65 | +| Feature | CMake Flag | Notes | |
| 66 | +|---------|------------|-------| |
| 67 | +| Compute | `ARROW_COMPUTE=ON` | Core compute functions | |
| 68 | +| CSV | `ARROW_CSV=ON` | CSV reading/writing | |
| 69 | +| Filesystem | `ARROW_FILESYSTEM=ON` | Local filesystem support | |
| 70 | +| JSON | `ARROW_JSON=ON` | JSON reading | |
| 71 | +| Parquet | `ARROW_PARQUET=ON` | Parquet file format | |
| 72 | +| Dataset | `ARROW_DATASET=ON` | Multi-file datasets | |
| 73 | +| Acero | `ARROW_ACERO=ON` | Query execution engine | |
| 74 | +| Mimalloc | `ARROW_MIMALLOC=ON` | Memory allocator | |
| 75 | +| LZ4 | `ARROW_WITH_LZ4=ON` | LZ4 compression | |
| 76 | +| Snappy | `ARROW_WITH_SNAPPY=ON` | Snappy compression | |
| 77 | +| RE2 | `ARROW_WITH_RE2=ON` | Regular expressions | |
| 78 | +| UTF8Proc | `ARROW_WITH_UTF8PROC=ON` | Unicode support | |
| 79 | + |
| 80 | +### Features controlled by LIBARROW_MINIMAL |
| 81 | + |
| 82 | +When `LIBARROW_MINIMAL=false`, the following additional features are enabled |
| 83 | +(via `$ARROW_DEFAULT_PARAM=ON`): |
| 84 | + |
| 85 | +| Feature | CMake Flag | Default | |
| 86 | +|---------|------------|---------| |
| 87 | +| S3 | `ARROW_S3` | `$ARROW_DEFAULT_PARAM` | |
| 88 | +| Jemalloc | `ARROW_JEMALLOC` | `$ARROW_DEFAULT_PARAM` | |
| 89 | +| Brotli | `ARROW_WITH_BROTLI` | `$ARROW_DEFAULT_PARAM` | |
| 90 | +| BZ2 | `ARROW_WITH_BZ2` | `$ARROW_DEFAULT_PARAM` | |
| 91 | +| Zlib | `ARROW_WITH_ZLIB` | `$ARROW_DEFAULT_PARAM` | |
| 92 | +| Zstd | `ARROW_WITH_ZSTD` | `$ARROW_DEFAULT_PARAM` | |
| 93 | + |
| 94 | +### Features that require explicit opt-in |
| 95 | + |
| 96 | +GCS (Google Cloud Storage) is **always off by default**, even when |
| 97 | +`LIBARROW_MINIMAL=false`: |
| 98 | + |
| 99 | +| Feature | CMake Flag | Default | Reason | |
| 100 | +|---------|------------|---------|--------| |
| 101 | +| GCS | `ARROW_GCS` | `OFF` | Build complexity, dependency size | |
| 102 | + |
| 103 | +To enable GCS in a source build, you must explicitly set `ARROW_GCS=ON`. |
| 104 | + |
| 105 | +**Why is GCS off by default?** |
| 106 | + |
| 107 | +GCS was turned off by default in [#48343](https://github.com/apache/arrow/pull/48343) |
| 108 | +(December 2025) because: |
| 109 | + |
| 110 | +1. Building google-cloud-cpp is fragile and adds significant build time |
| 111 | +2. The dependency on abseil (ABSL) has caused compatibility issues |
| 112 | +3. Users who need GCS can still enable it explicitly |
| 113 | + |
| 114 | +## Configuration file locations |
| 115 | + |
| 116 | +### libarrow source build configuration |
| 117 | + |
| 118 | +The main build script that controls source builds: |
| 119 | + |
| 120 | +**`r/inst/build_arrow_static.sh`** - CMake flags and defaults |
| 121 | +([view source](https://github.com/apache/arrow/blob/main/r/inst/build_arrow_static.sh)) |
| 122 | +the environment variables to look for are `LIBARROW_MINIMAL`, `ARROW_*`, and, `ARROW_DEFAULT_PARAM` |
| 123 | + |
| 124 | +### libarrow binary build configuration |
| 125 | + |
| 126 | +Each platform has its own configuration file: |
| 127 | + |
| 128 | +| Platform | Config file | Key settings | |
| 129 | +|----------|-------------|--------------| |
| 130 | +| macOS | `dev/tasks/r/github.packages.yml` | `LIBARROW_MINIMAL=false`, `ARROW_GCS=ON` | |
| 131 | +| Windows | `ci/scripts/PKGBUILD` | `ARROW_GCS=ON`, `ARROW_S3=ON` | |
| 132 | +| Linux | `compose.yaml` (`ubuntu-cpp-static`) | `LIBARROW_MINIMAL=false`, `ARROW_GCS=ON` | |
| 133 | + |
| 134 | +## R-universe builds |
| 135 | + |
| 136 | +[R-universe](https://apache.r-universe.dev/arrow) builds the Arrow R package |
| 137 | +for users who want newer versions than CRAN. R-universe behavior varies by |
| 138 | +platform and architecture: |
| 139 | + |
| 140 | +| Platform | Architecture | Build method | Features | |
| 141 | +|----------|--------------|--------------|----------| |
| 142 | +| macOS | ARM64 | Downloads prebuilt binary | Full (S3 + GCS) | |
| 143 | +| macOS | x86_64 | Downloads prebuilt binary | Full (S3 + GCS) | |
| 144 | +| Windows | x86_64 | Downloads prebuilt binary | Full (S3 + GCS) | |
| 145 | +| Windows | ARM64 | Not supported | NA | |
| 146 | +| Linux | x86_64 | Downloads prebuilt binary | Full (S3 + GCS) | |
| 147 | +| Linux | ARM64 | Builds from source | S3 only (no GCS) | |
| 148 | + |
| 149 | +### Why Linux ARM64 builds from source |
| 150 | + |
| 151 | +We only publish prebuilt Linux binaries for x86_64 architecture. The binary |
| 152 | +selection logic in `r/tools/nixlibs.R` (line 263) explicitly checks for this: |
| 153 | + |
| 154 | +```r |
| 155 | +if (identical(os, "darwin") || (identical(os, "linux") && identical(arch, "x86_64"))) { |
| 156 | +``` |
| 157 | +When R-universe builds on Linux ARM64 runners, no binary is available, so it |
| 158 | +falls back to building from source using `build_arrow_static.sh`. Since GCS |
| 159 | +defaults to OFF in that script, Linux ARM64 users don't get GCS support. |
| 160 | +
|
| 161 | +### Enabling GCS for Linux ARM64 |
| 162 | +
|
| 163 | +To provide full feature parity for Linux ARM64, we would need to: |
| 164 | +
|
| 165 | +1. Add an ARM64 Linux build job to `dev/tasks/r/github.packages.yml` |
| 166 | +2. Update `select_binary()` in `nixlibs.R` to recognize `linux-aarch64` |
| 167 | +3. Add the artifact pattern to `dev/tasks/tasks.yml` |
| 168 | +4. Update the nightly upload workflow |
| 169 | +
|
| 170 | +See [GH-36193](https://github.com/apache/arrow/issues/36193) for tracking this work. |
| 171 | +
|
| 172 | +Alternatively, changing the GCS default in `build_arrow_static.sh` from `OFF` |
| 173 | +to `$ARROW_DEFAULT_PARAM` would enable GCS for all source builds, including |
| 174 | +Linux ARM64 on R-universe. |
| 175 | +
|
| 176 | +## Checking installed features |
| 177 | +
|
| 178 | +Users can check which features are enabled in their installation: |
| 179 | +
|
| 180 | +```r |
| 181 | +# Show all capabilities |
| 182 | +arrow::arrow_info() |
| 183 | +
|
| 184 | +# Check specific features |
| 185 | +arrow::arrow_with_s3() |
| 186 | +arrow::arrow_with_gcs() |
| 187 | +``` |
| 188 | +
|
| 189 | +## Related documentation |
| 190 | +
|
| 191 | +- [Installation guide](../install.html) - User-facing installation docs |
| 192 | +- [Installation details](./install_details.html) - How the build system works |
| 193 | +- [Developer setup](./setup.html) - Building Arrow for development |
0 commit comments