Skip to content

Commit a515a4d

Browse files
cbb330raulcdpitroujmao-denverboneanxs
authored
Task #0: Add ORC column statistics APIs (#2)
* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968) ### Rationale for this change Cython built code is currently failing to compile on free threaded wheels due to: ``` /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’: /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous 43068 | __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL)); | ``` ### What changes are included in this PR? Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: #48965 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925) ### What changes are included in this PR? Bug fixes and robustness improvements in the IPC file reader: * Fix bug reading variadic buffers with pre-buffering enabled * Fix bug reading dictionaries with pre-buffering enabled * Validate IPC buffer offsets and lengths Testing improvements: * Exercise pre-buffering in IPC tests * Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated * Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job * Exercise pre-buffering in the IPC file fuzz target Miscellaneous: * Add convenience functions for integer overflow checking ### Are these changes tested? Yes, by existing and improved tests. ### Are there any user-facing changes? Bug fixes. **This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled. * GitHub Issue: #48924 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967) ### Rationale for this change The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled ### What changes are included in this PR? 1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields. 2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys. ### Are these changes tested? Manually on Windows, and CI ### Are there any user-facing changes? No * GitHub Issue: #48966 Authored-by: jianfengmao <jianfengmao@deephaven.io> Signed-off-by: David Li <li.davidm96@gmail.com> * GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692) ### Rationale for this change WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow. ### What changes are included in this PR? Early check the array is not all null values before serialize it ### Are these changes tested? Added tests. ### Are there any user-facing changes? No * GitHub Issue: #48691 Authored-by: rexan <rexan@apache.org> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948) ### Rationale for this change As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix ### What changes are included in this PR? - Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker. - Update `pymanager install` command to use newer API (old command fails with missing flags) - Update default python command to use the free-threaded required suffix if free-threaded wheels ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: #48947 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48990: [Ruby] Add support for writing date arrays (#48991) ### Rationale for this change There are date32 and date64 variants for date arrays. ### What changes are included in this PR? * Add `ArrowFormat::DateType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48990 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993) ### Rationale for this change It's a large variant of UTF-8 array. ### What changes are included in this PR? * Add `ArrowFormat::LargeUTF8Type#to_flatbuffers` * Add support for large UTF-8 array of `#values` and `#raw_records` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48992 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982) ### Rationale for this change `FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter. ### What changes are included in this PR? Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation: - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods - Deprecate the old Status/out-parameter overloads - Update C++ callers and R/Python/GLib bindings to use the new API ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. Status versions of FileReader::ReadRowGroup(s) have been deprecated. ```cpp virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, std::shared_ptr<::arrow::Table>* out); ``` * GitHub Issue: #48949 Lead-authored-by: fenfeng9 <fenfeng9@qq.com> Co-authored-by: fenfeng9 <36840213+fenfeng9@users.noreply.github.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989) ### Rationale for this change Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC. ### What changes are included in this PR? * Refer arguments of `garrow_filter_node_options_new()` * Refer arguments of `garrow_project_node_options_new()` * Refer arguments of `garrow_aggregate_node_options_new()` * Refer arguments of `garrow_literal_expression_new()` * Refer arguments of `garrow_call_expression_new()` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48985 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007) ### Rationale for this change When looking for the wheel the script was falling back to returning a 404 even when the wheel was found: ``` + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome 127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found ``` Timing out the job and failing. ### What changes are included in this PR? Correct logic and only return 404 if the file requested wasn't found. ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: #47692 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974) ### Rationale for this change Benchmark failing since C++20 upgrade due to lack of C++20 configuration ### What changes are included in this PR? Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach. Description as follows: > conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. > > This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present. ### Are these changes tested? I got :robot: to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly. > Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch. > > The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding. ### Are there any user-facing changes? Nope * GitHub Issue: #48912 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718) ### Rationale for this change Fixes https://github.com/apache/arrow/issues/36889 When writing CSV from a table where the first batch is empty, the header gets written twice: ```python table = pa.table({"col1": ["a", "b", "c"]}) combined = pa.concat_tables([table.schema.empty_table(), table]) write_csv(combined, buf) # Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n <-- header appears twice ``` ### What changes are included in this PR? The bug happens because: 1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization 2. The buffer is not cleared after flush 3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_` 4. The write loop then writes `data_buffer_` which still contains stale content The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths: - `WriteHeader()` - `WriteRecordBatch()` - `WriteTable()` This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again. ### Are these changes tested? Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`: - Empty batch at start of table - Empty batch in middle of table ### Are there any user-facing changes? No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches. * GitHub Issue: #36889 Lead-authored-by: Ruiyang Wang <ruiyang@anthropic.com> Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933) ### Rationale for this change #48932 ### What changes are included in this PR? - Fix `rsync` build error ODBC Nightly Package ### Are these changes tested? - tested in CI ### Are there any user-facing changes? - After fix, users should be able to get Nightly ODBC package release * GitHub Issue: #48932 Authored-by: Alina (Xi) Li <alina.li@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48951: [Docs] Add documentation relating to AI tooling (#48952) ### Rationale for this change Add guidance re AI tooling ### What changes are included in this PR? Updates to main docs and links to it from new contributor's guide ### Are these changes tested? No but I'll built the docs ### Are there any user-facing changes? Just docs :robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness. * GitHub Issue: #48951 Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49029: [Doc] Run sphinx-build in parallel (#49026) ### Rationale for this change `sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs). ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #49029 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-33450: [C++] Remove GlobalForkSafeMutex (#49033) ### Rationale for this change This functionality is unused now that we have a proper atfork facility. ### Are these changes tested? By existing CI tests. ### Are there any user-facing changes? Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal). * GitHub Issue: #33450 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956) ### Rationale for this change The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete. It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies. The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved. ### What changes are included in this PR? Removed the outdated TODO comment that referenced GH-35437. ### Are these changes tested? I did not test. ### Are there any user-facing changes? No. * GitHub Issue: #35437 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008) ### Rationale for this change When running the python-sdist job we are currently not uploading the build artifact to the job. ### What changes are included in this PR? Upload artifact as part of building the job so it's easier to test and validate contents if necessary. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: #48586 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039) ### Rationale for this change CI needs updating to test old R package versions ### What changes are included in this PR? Add 22.0.0.1 ### Are these changes tested? Nah, it's CI stuff ### Are there any user-facing changes? No Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969) ### Rationale for this change See issue #48961 Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default ### What changes are included in this PR? Updating several doctest examples from `string` to `large_string`. ### Are these changes tested? Yes, locally. ### Are there any user-facing changes? No. Closes #48961 * GitHub Issue: #48961 Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-49037: [Benchmarking] Install R from non-conda source for benchmarking (#49038) ### Rationale for this change Slow benchmarks due to conda duckdb building from source ### What changes are included in this PR? Try ditching conda and installing R via rig and using PPM binaries ### Are these changes tested? I'll try running ### Are there any user-facing changes? Nope * GitHub Issue: #49037 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49042: [C++] Remove mimalloc patch (#49041) ### Rationale for this change This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139 ### Are these changes tested? By existing CI. ### Are there any user-facing changes? No. * GitHub Issue: #49042 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49024: [CI] Update Debian version in `.env` (#49032) ### Rationale for this change Default Debian version in `.env` now maps to oldstable, we should use stable instead. Also prune entries that are not used anymore. ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #49024 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49027: [Ruby] Add support for writing time arrays (#49028) ### Rationale for this change There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays. ### What changes are included in this PR? * Add `ArrowFormat::TimeType#to_flatbuffers` * Add bit width information to `ArrowFormat::TimeType` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49027 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49030: [Ruby] Add support for writing fixed size binary array (#49031) ### Rationale for this change It's a fixed size variant of binary array. ### What changes are included in this PR? * Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers` * Add `ArrowFormat::FixedSizeBinaryArray#each_buffer` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49030 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867) ### Rationale for this change Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations. ### What changes are included in this PR? - Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error - Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #48866 Authored-by: Arkadii Kravchuk <arkadii.kravchuk@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674) ### Rationale for this change This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers. I could not find the relevant example to demonstrate within this project but assume that we have a test such as: (Generated by ChatGPT) ```cpp TEST(BlockParser, ErrorMessageWithColonsPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value\n" "parser_test.cc:940 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } // Test with URL-like data (another common case with colons) TEST(BlockParser, ErrorMessageWithURLPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api\n" "parser_test.cc:974 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } ``` then it fails. ### What changes are included in this PR? Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped. ### Are these changes tested? Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`. ### Are there any user-facing changes? No, test-only. * GitHub Issue: #48673 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052) ### Rationale for this change See: #49044 ### What changes are included in this PR? Urllib now request with `"user-agent": "pyarrow"` ### Are these changes tested? It's a CI fix. ### Are there any user-facing changes? No, just a CI test fix. * GitHub Issue: #49044 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988) ### Rationale for this change Currently the files are missing from the published wheels. ### What changes are included in this PR? - Ensure the license and notice files are part of the wheels - Use build frontend to build wheels - Build wheel from sdist ### Are these changes tested? Yes, via archery. I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing: ``` AssertionError: LICENSE.txt is missing from the wheel. ``` ### Are there any user-facing changes? No * GitHub Issue: #48983 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060) ### Rationale for this change Fix two issues found by OSS-Fuzz in the IPC reader: * a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984 * a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408 None of these two issues is a security issue. ### Are these changes tested? Yes, by new unit tests and new fuzz regression files. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #49059 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056) ### Rationale for this change Decimal128/256 arrays are only supported. ### What changes are included in this PR? Add `ArrowFormat::DecimalType#to_flatbuffers`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49055 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49053: [Ruby] Add support for writing timestamp array (#49054) ### Rationale for this change It has `unit` and `time_zone` parameters. ### What changes are included in this PR? * Add `ArrowFormat::TimestampType#to_flatbuffers` * Set time zone when GLib timestamp type is converted from C++ timestamp type * Use `time_zone` not `timezone` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49053 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst` * GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066) ### Rationale for this change The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types. ### What changes are included in this PR? Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership. ### Are these changes tested? Yes, existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #49065 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063) ### Rationale for this change Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed. ### What changes are included in this PR? Refactor the Engine class to only create one target machine and pass that to the necessary functions. Before the change (3 TargetMachines created): First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout. Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler. Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine. After the change (1 TargetMachine created): The key changes are: Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine). Use SimpleCompiler instead of TMOwningSimpleCompiler: SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created. A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance. ### Are these changes tested? Yes, unit and integration. ### Are there any user-facing changes? No. * GitHub Issue: #48159 Lead-authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com> Co-authored-by: Logan Riggs <logan.riggs@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049) ### Rationale for this change Prevent bugs similar to https://github.com/apache/arrow/issues/49043 ### What changes are included in this PR? - Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`. - Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response. ### Are these changes tested? Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression. The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce. ### Are there any user-facing changes? Some low probability bugs will be gone. No interface changes. * GitHub Issue: #49043 Authored-by: Thomas Newton <thomas.w.newton@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035) ### Rationale for this change The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error. ### What changes are included in this PR? Add kCanReturnErrors to the function definition to match other string functions. Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation. Add a unit test. ### Are these changes tested? Yes, unit and integration testing. ### Are there any user-facing changes? No. * GitHub Issue: #49034 Authored-by: Logan Riggs <logan.riggs@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981) ### Rationale for this change Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11). ### What changes are included in this PR? Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments. ### Are these changes tested? Yes, through CI build and existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #48980 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49069: [C++] Share Trie instances across CSV value decoders (#49070) ### Rationale for this change The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead. ### What changes are included in this PR? - Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie) - Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries - Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders ### Are these changes tested? Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage. ### Are there any user-facing changes? No. * GitHub Issue: #49069 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49076: [CI] Update vcpkg baseline to newer version (#49062) ### Rationale for this change The current version of vcpkg used is a from April 2025 ### What changes are included in this PR? Update baseline to newer version. ### Are these changes tested? Yes on CI. I've validated for example that xsimd 14 will be pulled. ### Are there any user-facing changes? No * GitHub Issue: #49076 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49074: [Ruby] Add support for writing interval arrays (#49075) ### Rationale for this change There are year month/day time/month day nano variants. ### What changes are included in this PR? * Add `ArrowFormat::IntervalType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49074 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49071: [Ruby] Add support for writing list and large list arrays (#49072) ### Rationale for this change They use different offset size. ### What changes are included in this PR? * Add `ArrowFormat::ListType#to_flatbuffers` * Add `ArrowFormat::LargeListType#to_flatbuffers` * Add `ArrowFormat::VariableSizeListArray#child` * Add `ArrowFormat::VariableSizeListArray#each_buffer` * `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist * Add `garrow_list_array_get_value_offsets_buffer()` * Add `garrow_large_list_array_get_value_offsets_buffer()` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49071 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091) ### Rationale for this change LLVM 15 or earlier uses `llvm::Optional` not `std::optional`. ### What changes are included in this PR? Use `llvm::Optional` with LLVM 15 or earlier. ### Are these changes tested? Yes, compiling. ### Are there any user-facing changes? No * GitHub Issue: #49087 Authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101) ### Rationale for this change The Swift documentation link in the implementations.rst file was broken and returned a 404 error. ### What changes are included in this PR? Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow) ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #49100 Lead-authored-by: ChiLin Chiu <chilin.chiou@gmail.com> Co-authored-by: Chilin <chilin.cs07@nycu.edu.tw> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49096: [Ruby] Add support for writing struct array (#49097) ### Rationale for this change It's a nested array. ### What changes are included in this PR? * Add `ArrowFormat::StructType#to_flatbuffers` * Add `ArrowFormat::StructArray#each_buffer` * Add `ArrowFormat::StructArray#children` * Fix `ArrowFormat::Array#n_nulls` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49096 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49093: [Ruby] Add support for writing duration array (#49094) ### Rationale for this change It has unit parameter. ### What changes are included in this PR? * Add `ArrowFormat::DurationType#to_flatbuffers` * Add duration support to `#values` and `raw_records` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49093 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099) ### Rationale for this change Documents for libarrow-cuda-glib are generated but they aren't packaged. ### What changes are included in this PR? Package documents for libarrow-cuda-glib. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49098 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48764: [C++] Update xsimd (#48765) ### Rationale for this change Homogenized versions used ### What changes are included in this PR? Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines. ### Are these changes tested? Yes, with current CI. In fact due to the absence of pin, part of the CI already runs xsimd 14. ### Are there any user-facing changes? No. * GitHub Issue: #48764 Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047) ### Rationale for this change As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken. ### What changes are included in this PR? Remove asv benchmarking related files and docs. ### Are these changes tested? No, Validate CI and run preview-docs to validate docs. ### Are there any user-facing changes? No * GitHub Issue: #46008 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109) ### Rationale for this change `SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix. ### What changes are included in this PR? Add f prefix to the string in `SparseCOOTensor.__repr__`. ### Are these changes tested? Yes, work after adding. f-string prefix: ```python3 >>> import pyarrow as pa >>> import numpy as np >>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32) >>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor) >>> sparse_coo <pyarrow.SparseCOOTensor> type: float shape: (2, 3) ``` ### Are there any user-facing changes? a bug that caused incorrect or invalid data to be produced: ```python3 >>> import pyarrow as pa >>> import numpy as np >>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32) >>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor) >>> sparse_coo <pyarrow.SparseCOOTensor> type: {self.type} shape: {self.shape} ``` * GitHub Issue: #49108 Authored-by: Chilin <chilin.cs07@nycu.edu.tw> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126) ### Rationale for this change Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel). ### What changes are included in this PR? Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025. ### Are these changes tested? Yes, with extendeed dask build. ### Are there any user-facing changes? No. * GitHub Issue: #49083 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49117: [Ruby] Add support for writing union arrays (#49118) ### Rationale for this change There are dense and sparse variants. ### What changes are included in this PR? * Add `garrow_union_array_get_n_fields()` * Add `ArrowFormat::UnionArray#children` * Add `ArrowFormat::DenseUnionArray#each_buffer` * Add `ArrowFormat::SparseUnionArray#each_buffer` * Add `ArrowFormat::UnionType#to_flatbuffers` * Add `Arrow::UnionArray#fields` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49117 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49119: [Ruby] Add support for writing map array (#49120) ### Rationale for this change It's a list based array. ### What changes are included in this PR? * Add `ArrowFormat::MapType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49119 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48922: [C++] Support Status-returning callables in Result::Map (#49127) ### Rationale for this change Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases. ### What changes are included in this PR? - Added EnsureResult specialization to allow Map to return Status directly. - Added unit tests to verify success/error propagation and return type resolution. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * GitHub Issue: #48922 Authored-by: Abhishek Bansal <abhibansal593@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095) ### Rationale for this change This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal. `fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases. ### What changes are included in this PR? Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion. ### Are these changes tested? Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}. ### Are there any user-facing changes? It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`. With this patch, the CSV reader in PyArrow outputs: ```python >>> import pyarrow >>> import pyarrow.csv >>> import io >>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode())) >>> print(table) pyarrow.Table data: double ---- data: [[0,inf,-inf]] ``` Closes #49003 * GitHub Issue: #49003 Authored-by: Alvaro-Kothe <kothe65@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943) ### Rationale for this change The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling. ### What changes are included in this PR? Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629. Added that function as an util. ### Are these changes tested? There are existent tests for JSON. ### Are there any user-facing changes? No, test-only. * GitHub Issue: #48941 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49067: [R] Disable GCS on macos (#49068) ### Rationale for this change Builds that complete on CRAN ### What changes are included in this PR? Disable GCS by default ### Are these changes tested? ### Are there any user-facing changes? Hopefully not **This PR includes breaking changes to public APIs.** (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.) **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #49067 --------- Co-authored-by: Nic Crane <thisisnic@gmail.com> * GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116) ### Rationale for this change Current wheels are failing to be built due to old version of vcpkg failing with our latest main. ### What changes are included in this PR? - Update vcpkg version. - Update patches - Add `perl-Time-Piece` to some images as required to build newer OpenSSL. ### Are these changes tested? Yes on CI ### Are there any user-facing changes? No * GitHub Issue: #49115 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955) ### Rationale for this change Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this: ```python import pyarrow as pa import pyarrow.compute as pc pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32())) # [0, 1, 2, 3] pc.array_sort_indices(pa.DictionaryArray.from_arrays( indices=pa.array([None, None, None, None], type=pa.int8()), dictionary=pa.array([], type=pa.null()) )) # [0, 1, 2, 3] ``` I believe it does not make sense to specifically disallow this in dictionaries at this point. ### What changes are included in this PR? Added a unittest for null sorting behaviour. ### Are these changes tested? Yes, the unittest was added. ### Are there any user-facing changes? No. * GitHub Issue: #48954 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-36193: [R] arm64 binaries for R (#48574) ### Rationale for this change Issues building on ARM ### What changes are included in this PR? CI job and nixlibs update ### Are these changes tested? On CI ### Are there any user-facing changes? No AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc) * GitHub Issue: #36193 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-48397: [R] Update docs on how to get our libarrow builds (#48995) ### Rationale for this change Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that. ### What changes are included in this PR? Update docs. ### Are these changes tested? Will preview docs build. ### Are there any user-facing changes? Just docs. * GitHub Issue: #48397 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105) ### Rationale for This Change The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ. ### What Changes Are Included in This PR? This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access. ### Are These Changes Tested? Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically. ### Are There Any User-Facing Changes? No. This change improves internal safety and robustness without altering public APIs or observable user behavior. * GitHub Issue: #49104 Lead-authored-by: Alirana2829 <alimahmoodrana00@gmail.com> Co-authored-by: Ali Mahmood Rana <159713825+AliRana30@users.noreply.github.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Rok Mihevc <rok@mihevc.org> * MINOR: [Docs] Add links to AI-generated code guidance (#49131) ### Rationale for this change Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though ### What changes are included in this PR? Add link to AI-generated code guidance ### Are these changes tested? No ### Are there any user-facing changes? No Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * MINOR: [R] Add new vignette to pkgdown config (#49145) ### Rationale for this change CI failing on preview-docs; see #49141 ### What changes are included in this PR? Add the vignette created in #49068 to pkgdown config ### Are these changes tested? I'll trigger CI ### Are there any user-facing changes? Nah Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088) Fixes: #49150 See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381 ### Rationale for this change Fix CI failures ### What changes are included in this PR? Tests are made more general to allow for Pandas 2 and Pandas 3 style string types ### Are these changes tested? By CI ### Are there any user-facing changes? No * GitHub Issue: #49150 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Rok Mihevc <rok@mihevc.org> * GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971) Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix. ### Rationale for this change I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced. ### What changes are included in this PR? AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang. This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type. Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above. ### Are these changes tested? I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows. One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it. ### Are there any user-facing changes? Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those. * GitHub Issue: #41990 Lead-authored-by: Nate Prewitt <nateprewitt@microsoft.com> Co-authored-by: Nate Prewitt <nate.prewitt@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139) ### Rationale for this change We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138 ### What changes are included in this PR? Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2). ### Are these changes tested? Tes. ### Are there any user-facing changes? No. * GitHub Issue: #49138 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769) ### Rationale for this change Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics). The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working. ### What changes are included in this PR? - Changed validation from `start >= stop` to `start > stop` - Updated error message - Added test cases ### Are these changes tested? Yes, tests were added. ### Are there any user-facing changes? Yes. ```python import pyarrow.compute as pc pc.list_slice([[1,2,3]], 0, 0) ``` Before: ``` pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0) ``` After: ``` <pyarrow.lib.ListArray object at 0x1a01b8b20> [ [] ] ``` * GitHub Issue: #33459 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135) Closes https://github.com/apache/arrow/issues/41863 ### Rationale for this change Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/ `LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec: ``` ArrowException: Unsupported compression: lz4_raw ``` This is a friction issue, and confusing for some users who are aware of the differences. ### What changes are included in this PR? - Adding `LZ4_RAW` to the acceptable codec names list. - Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`. - Adding a test ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, an additive change to the accepted codec names. * GitHub Issue: #41863 Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-48868: [Doc] Document security model for the Arrow formats (#48870) ### Rationale for this change Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those. ### What changes are included in this PR? Add a Security Considerations page in the Format section. **Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html ### Are these changes tested? N/A ### Are there any user-facing changes? No. * GitHub Issue: #48868 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005) ### Rationale for this change #49004 ### What changes are included in this PR? - Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI. Note: `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050 ### Are these changes tested? Yes, in CI ### Are there any user-facing changes? N/A * GitHub Issue: #49004 Lead-authored-by: Alina (Xi) Li <alina.li@improving.com> Co-authored-by: Alina (Xi) Li <96995091+alinaliBQ@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151) ### Rationale for this change #49092 ### What changes are included in this PR? - Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly. Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`. ### Are these changes tested? Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26 ### Are there any user-facing changes? Yes, the nightly ODBC file names will be changed as described above. * GitHub Issue: #49092 Authored-by: Alina (Xi) Li <alina.li@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49156: [Python] Require GIL for string comparison (#49161) ### Rationale for this change With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL. ### What changes are included in this PR? Moving statement out of the `with nogil` context manager. ### Are these changes tested? Existing CI builds pyarrow. ### Are there any user-facing changes? No * GitHub Issue: #49156 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577) ### Rationale for this change #48575 ### What changes are included in this PR? - Add new ODBC workflow for macOS Intel 15 and 14 arm64. - Added ODBC build fixes to enable build on macOS CI. ### Are these changes tested? Tested in CI and local macOS Intel and M1 environments. ### Are there any user-facing changes? N/A * GitHub Issue: #48575 Lead-authored-by: Alina (Xi) Li <alina.li@improving.com> Co-authored-by: justing-bq <62349012+justing-bq@users.noreply.github.com> Co-authored-by: Victor Tsang <victor.tsang@improving.com> Co-authored-by: Alina (Xi) Li <alinal@bitquilltech.com> Co-authored-by: vic-tsang <victor.tsang@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165) ### Rationale for this change Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments. ### What changes are included in this PR? Use the variable name directly (no `${}`). ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: #49164 Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48132: [Ruby] Add support for writing dictionary array (#49175) ### Rationale for this change Delta dictionary message support is out of scope. ### What changes are included in this PR? * Add `ArrowFormat::DictionaryArray#each_buffer` * Add `ArrowFormat::DictionaryType#build_fb_type` * Add support for dictionary message in `ArrowFormat::StreamingWriter` * Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48132 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49081: [C++][Parquet] Correct variant's extension name (#49082) ### Rationale for this change Correct variant extension according to arrow's specification. ### What changes are included in this PR? Modified variant's hardcoded extension name. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #49081 Authored-by: Zehua Zou <zehuazou2000@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618) ### Rationale for this change This is the first in series of PRs adding type annotations to pyarrow and resolving #32609. ### What changes are included in this PR? This PR establishes infrastructure for type checking: - Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows - Configures type checkers to validate stub files (excluding source files for now) - Adds PEP 561 `py.typed` marker to enable type checking - Updates wheel build scripts to include stub files in distributions - Creates initial minimal stub directory structure - Updates developer documentation with type checking workflow ### Are these changes tested? No. This is mostly a CI change. ### Are there any user-facing changes? This does not add any actual annotations (only `py.typed` marker) so user should not be affected. * GitHub Issue: #32609 * GitHub Issue: #49102 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> * GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192) ### Rationale for this change See #49190 ### What changes are included in this PR? Fix `unknown job 'odbc' error` caused by typo ### Are these changes tested? Tested in CI ### Are there any user-facing changes? N/A * GitHub Issue: #49190 Authored-by: Alina (Xi) Li <alinal@bitquilltech.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191) Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p> <blockquote> <h2>v3.7.0</h2> <ul> <li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li> <li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@​dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li> <li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect…
1 parent 1779d26 commit a515a4d

319 files changed

Lines changed: 12625 additions & 3895 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ ULIMIT_CORE=-1
5252
# Default versions for platforms
5353
ALMALINUX=8
5454
ALPINE_LINUX=3.22
55-
DEBIAN=12
55+
DEBIAN=13
5656
FEDORA=42
5757
UBUNTU=22.04
5858

@@ -61,11 +61,9 @@ CLANG_TOOLS=18
6161
CMAKE=3.26.0
6262
CUDA=11.7.1
6363
DASK=latest
64-
DOTNET=8.0
6564
GCC=
6665
HDFS=3.2.1
6766
JDK=11
68-
KARTOTHEK=latest
6967
# LLVM 12 and GCC 11 reports -Wmismatched-new-delete.
7068
LLVM=18
7169
MAVEN=3.8.7
@@ -79,7 +77,6 @@ PYTHON_IMAGE_TAG=3.10
7977
PYTHON_ABI_TAG=cp310
8078
R=4.5
8179
SPARK=master
82-
TURBODBC=latest
8380

8481
# These correspond to images on Docker Hub that contain R, e.g. rhub/ubuntu-release:latest
8582
R_IMAGE=ubuntu-release
@@ -96,14 +93,14 @@ TZ=UTC
9693
# Used through compose.yaml and serves as the default version for the
9794
# ci/scripts/install_vcpkg.sh script. Prefer to use short SHAs to keep the
9895
# docker tags more readable.
99-
VCPKG="4334d8b4c8916018600212ab4dd4bbdc343065d1" # 2025.09.17 Release
96+
VCPKG="66c0373dc7fca549e5803087b9487edfe3aca0a1" # 2026.01.16 Release
10097

10198
# This must be updated when we update
10299
# ci/docker/python-*-windows-*.dockerfile or the vcpkg config.
103100
# This is a workaround for our CI problem that "archery docker build" doesn't
104101
# use pulled built images in dev/tasks/python-wheels/github.windows.yml.
105-
PYTHON_WHEEL_WINDOWS_IMAGE_REVISION=2025-10-13
106-
PYTHON_WHEEL_WINDOWS_TEST_IMAGE_REVISION=2025-10-13
102+
PYTHON_WHEEL_WINDOWS_IMAGE_REVISION=2026-02-07
103+
PYTHON_WHEEL_WINDOWS_TEST_IMAGE_REVISION=2026-02-07
107104

108105
# Use conanio/${CONAN_BASE}:{CONAN_VERSION} for "docker compose run --rm conan".
109106
# See https://github.com/conan-io/conan-docker-tools#readme and

.github/pull_request_template.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ If this is your first pull request you can find detailed information on how to c
44

55
* [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
66
* [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
7+
* [AI-generated Code Guidance](https://arrow.apache.org/docs/dev/developers/overview.html#ai-generated-code)
78

89
Please remove this line and the above text before creating your pull request.
910

.github/workflows/cpp_extra.yml

Lines changed: 188 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -336,9 +336,79 @@ jobs:
336336
cd cpp/examples/minimal_build
337337
../minimal_build.build/arrow-example
338338
339-
odbc:
339+
odbc-macos:
340340
needs: check-labels
341-
name: ODBC
341+
name: ODBC ${{ matrix.architecture }} macOS ${{ matrix.macos-version }}
342+
runs-on: macos-${{ matrix.macos-version }}
343+
if: >-
344+
needs.check-labels.outputs.force == 'true' ||
345+
contains(fromJSON(needs.check-labels.outputs.ci-extra-labels || '[]'), 'CI: Extra') ||
346+
contains(fromJSON(needs.check-labels.outputs.ci-extra-labels || '[]'), 'CI: Extra: C++')
347+
timeout-minutes: 75
348+
strategy:
349+
fail-fast: false
350+
matrix:
351+
include:
352+
- architecture: AMD64
353+
macos-version: "15-intel"
354+
- architecture: ARM64
355+
macos-version: "14"
356+
env:
357+
ARROW_BUILD_TESTS: ON
358+
ARROW_FLIGHT_SQL_ODBC: ON
359+
ARROW_HOME: /tmp/local
360+
steps:
361+
- name: Checkout Arrow
362+
uses: actions/checkout@v6.0.1
363+
with:
364+
fetch-depth: 0
365+
submodules: recursive
366+
- name: Install Dependencies
367+
run: |
368+
brew bundle --file=cpp/Brewfile
369+
- name: Setup ccache
370+
run: |
371+
ci/scripts/ccache_setup.sh
372+
- name: ccache info
373+
id: ccache-info
374+
run: |
375+
echo "cache-dir=$(ccache --get-config cache_dir)" >> $GITHUB_OUTPUT
376+
- name: Cache ccache
377+
uses: actions/cache@v5.0.2
378+
with:
379+
path: ${{ steps.ccache-info.outputs.cache-dir }}
380+
key: cpp-odbc-ccache-macos-${{ matrix.macos-version }}-${{ hashFiles('cpp/**') }}
381+
restore-keys: cpp-odbc-ccache-macos-${{ matrix.macos-version }}-
382+
- name: Build
383+
run: |
384+
# Homebrew uses /usr/local as prefix. So packages
385+
# installed by Homebrew also use /usr/local/include. We
386+
# want to include headers for packages installed by
387+
# Homebrew as system headers to ignore warnings in them.
388+
# But "-isystem /usr/local/include" isn't used by CMake
389+
# because /usr/local/include is marked as the default
390+
# include path. So we disable -Werror to avoid build error
391+
# by warnings from packages installed by Homebrew.
392+
export BUILD_WARNING_LEVEL=PRODUCTION
393+
LIBIODBC_DIR="$(brew --cellar libiodbc)/$(brew list --versions libiodbc | awk '{print $2}')"
394+
ODBC_INCLUDE_DIR=$LIBIODBC_DIR/include
395+
export ARROW_CMAKE_ARGS="-DODBC_INCLUDE_DIR=$ODBC_INCLUDE_DIR"
396+
export CXXFLAGS="$CXXFLAGS -I$ODBC_INCLUDE_DIR"
397+
ci/scripts/cpp_build.sh $(pwd) $(pwd)/build
398+
- name: Register Flight SQL ODBC Driver
399+
run: |
400+
sudo cpp/src/arrow/flight/sql/odbc/install/mac/install_odbc.sh $(pwd)/build/cpp/debug/libarrow_flight_sql_odbc.dylib
401+
- name: Test
402+
shell: bash
403+
run: |
404+
sudo sysctl -w kern.coredump=1
405+
sudo sysctl -w kern.corefile=/tmp/core.%N.%P
406+
ulimit -c unlimited # must enable within the same shell
407+
ci/scripts/cpp_test.sh $(pwd) $(pwd)/build
408+
409+
odbc-msvc:
410+
needs: check-labels
411+
name: ODBC Windows
342412
runs-on: windows-2022
343413
if: >-
344414
needs.check-labels.outputs.force == 'true' ||
@@ -352,6 +422,9 @@ jobs:
352422
ARROW_BUILD_STATIC: OFF
353423
ARROW_BUILD_TESTS: ON
354424
ARROW_BUILD_TYPE: release
425+
# Turn Arrow CSV off to disable `find_package(Arrow)` check on MSVC CI.
426+
# GH-49050 TODO: enable `find_package(Arrow)` check on MSVC CI.
427+
ARROW_CSV: OFF
355428
ARROW_DEPENDENCY_SOURCE: VCPKG
356429
ARROW_FLIGHT_SQL_ODBC: ON
357430
ARROW_FLIGHT_SQL_ODBC_INSTALLER: ON
@@ -434,10 +507,15 @@ jobs:
434507
shell: cmd
435508
run: |
436509
call "cpp\src\arrow\flight\sql\odbc\tests\install_odbc.cmd" ${{ github.workspace }}\build\cpp\%ARROW_BUILD_TYPE%\arrow_flight_sql_odbc.dll
437-
# GH-48270 TODO: Resolve segementation fault during Arrow library unload
438-
# GH-48269 TODO: Enable Flight & Flight SQL testing in MSVC CI
439-
# GH-48547 TODO: enable ODBC tests after GH-48270 and GH-48269 are resolved.
440-
510+
- name: Test
511+
shell: cmd
512+
run: |
513+
set VCPKG_ROOT_KEEP=%VCPKG_ROOT%
514+
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
515+
set VCPKG_ROOT=%VCPKG_ROOT_KEEP%
516+
# Convert VCPKG Windows path to MSYS path
517+
for /f "usebackq delims=" %%I in (`bash -c "cygpath -u \"$VCPKG_ROOT_KEEP\""` ) do set VCPKG_ROOT=%%I
518+
bash -c "ci/scripts/cpp_test.sh $(pwd) $(pwd)/build"
441519
- name: Install WiX Toolset
442520
shell: pwsh
443521
run: |
@@ -455,18 +533,79 @@ jobs:
455533
uses: actions/upload-artifact@v6
456534
with:
457535
name: flight-sql-odbc-msi-installer
458-
path: build/cpp/Apache Arrow Flight SQL ODBC-*-win64.msi
536+
path: build/cpp/Apache-Arrow-Flight-SQL-ODBC-*-win64.msi
459537
if-no-files-found: error
460-
# Upload ODBC installer as nightly release in scheduled runs
538+
- name: Install ODBC MSI
539+
run: |
540+
cd build/cpp
541+
$odbc_msi = Get-ChildItem -Filter "Apache-Arrow-Flight-SQL-ODBC-*-win64.msi"
542+
if (-not $odbc_msi) {
543+
Write-Error "ODBC MSI not found"
544+
exit 1
545+
}
546+
547+
foreach ($msi in $odbc_msi) {
548+
Write-Host "Installing $($msi.Name) with logs"
549+
$log = "odbc-install.log"
550+
Start-Process msiexec.exe -Wait -ArgumentList "/i `"$msi`"", "/qn", "/L*V `"$log`""
551+
Get-Content $log
552+
}
553+
- name: Check ODBC DLL installation
554+
run: |
555+
$dirs = Get-ChildItem "C:\Program Files" -Directory -Filter "Apache-Arrow-Flight-SQL-ODBC*"
556+
557+
foreach ($dir in $dirs) {
558+
$bin = Join-Path $dir.FullName "bin"
559+
560+
if (Test-Path $bin) {
561+
tree $bin /f
562+
563+
$dll = Join-Path $bin "arrow_flight_sql_odbc.dll"
564+
if (Test-Path $dll) {
565+
Write-Host "Found ODBC DLL: $dll"
566+
exit 0
567+
}
568+
}
569+
}
570+
571+
Write-Error "ODBC DLL not found"
572+
exit 1
573+
574+
odbc-nightly:
575+
needs: odbc-msvc
576+
name: ODBC nightly
577+
runs-on: ubuntu-latest
578+
if: github.event_name == 'schedule' && github.repository == 'apache/arrow'
579+
steps:
580+
- name: Download the artifacts
581+
uses: actions/download-artifact@v7
582+
with:
583+
name: flight-sql-odbc-msi-installer
461584
- name: Prepare ODBC installer for sync
462-
if: github.event_name == 'schedule'
463585
run: |
464586
mkdir odbc-installer
465-
Move-Item "build/cpp/Apache Arrow Flight SQL ODBC-*-win64.msi" odbc-installer/
466-
tree odbc-installer /f
587+
mv *.msi odbc-installer/
588+
589+
# Add `dev-yyyy-mm-dd` to ODBC MSI before `win64.msi`:
590+
# Apache Arrow Flight SQL ODBC-24.0.0-win64.msi ->
591+
# Apache Arrow Flight SQL ODBC-24.0.0-dev-2026-02-06-win64.msi
592+
cd odbc-installer
593+
msi_name=$(ls *.msi)
594+
dev_msi_name=$(echo ${msi_name} | sed -e "s/win64\.msi$/dev-$(date +%Y-%m-%d)-win64.msi/")
595+
mv "${msi_name}" "${dev_msi_name}"
596+
cd ..
597+
598+
tree odbc-installer
599+
- name: Checkout Arrow
600+
uses: actions/checkout@v6
601+
with:
602+
fetch-depth: 1
603+
path: arrow
604+
repository: apache/arrow
605+
ref: main
606+
submodules: recursive
467607
- name: Sync to Remote
468-
if: github.event_name == 'schedule'
469-
uses: ./.github/actions/sync-nightlies
608+
uses: ./arrow/.github/actions/sync-nightlies
470609
with:
471610
upload: true
472611
switches: -avzh --update --delete --progress
@@ -478,13 +617,48 @@ jobs:
478617
remote_key: ${{ secrets.NIGHTLIES_RSYNC_KEY }}
479618
remote_host_key: ${{ secrets.NIGHTLIES_RSYNC_HOST_KEY }}
480619

620+
odbc-release:
621+
needs: odbc-msvc
622+
name: ODBC release
623+
runs-on: ubuntu-latest
624+
if: ${{ startsWith(github.ref_name, 'apache-arrow-') && contains(github.ref_name, '-rc') }}
625+
permissions:
626+
# Upload to GitHub Release
627+
contents: write
628+
steps:
629+
- name: Checkout Arrow
630+
uses: actions/checkout@v6
631+
with:
632+
fetch-depth: 0
633+
submodules: recursive
634+
- name: Download the artifacts
635+
uses: actions/download-artifact@v7
636+
with:
637+
name: flight-sql-odbc-msi-installer
638+
- name: Wait for creating GitHub Release
639+
env:
640+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
641+
run: |
642+
dev/release/utils-watch-gh-workflow.sh \
643+
${GITHUB_REF_NAME} \
644+
release_candidate.yml
645+
- name: Upload the artifacts to GitHub Release
646+
env:
647+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
648+
run: |
649+
gh release upload ${GITHUB_REF_NAME} \
650+
--clobber \
651+
Apache-Arrow-Flight-SQL-ODBC-*-win64.msi
652+
481653
report-extra-cpp:
482654
if: github.event_name == 'schedule' && always()
483655
needs:
484656
- docker
485657
- jni-linux
486658
- jni-macos
487659
- msvc-arm64
488-
- odbc
660+
- odbc-macos
661+
- odbc-msvc
662+
- odbc-nightly
489663
uses: ./.github/workflows/report_ci.yml
490664
secrets: inherit

.github/workflows/cpp_windows.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,12 +41,14 @@ jobs:
4141
runs-on: ${{ inputs.os }}
4242
timeout-minutes: 60
4343
env:
44+
ARROW_AZURE: ON
4445
ARROW_BOOST_USE_SHARED: OFF
4546
ARROW_BUILD_BENCHMARKS: ON
4647
ARROW_BUILD_SHARED: ON
4748
ARROW_BUILD_STATIC: OFF
4849
ARROW_BUILD_TESTS: ON
4950
ARROW_DATASET: ON
51+
ARROW_FILESYSTEM: ON
5052
ARROW_FLIGHT: OFF
5153
ARROW_HDFS: ON
5254
ARROW_HOME: /usr

.github/workflows/package_linux.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ jobs:
218218
rake version:update
219219
popd
220220
- name: Login to GitHub Container registry
221-
uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3.6.0
221+
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3.7.0
222222
with:
223223
registry: ghcr.io
224224
username: ${{ github.actor }}

.github/workflows/python.yml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ jobs:
6060
timeout-minutes: 60
6161
strategy:
6262
fail-fast: false
63+
max-parallel: 20
6364
matrix:
6465
name:
6566
- conda-python-docs
@@ -69,10 +70,10 @@ jobs:
6970
- conda-python-3.12-no-numpy
7071
include:
7172
- name: conda-python-docs
72-
cache: conda-python-3.10
73+
cache: conda-python-3.11
7374
image: conda-python-docs
74-
title: AMD64 Conda Python 3.10 Sphinx & Numpydoc
75-
python: "3.10"
75+
title: AMD64 Conda Python 3.11 Sphinx & Numpydoc
76+
python: "3.11"
7677
- name: conda-python-3.11-nopandas
7778
cache: conda-python-3.11
7879
image: conda-python
@@ -145,12 +146,15 @@ jobs:
145146
timeout-minutes: 60
146147
strategy:
147148
fail-fast: false
149+
max-parallel: 20
148150
matrix:
149151
include:
150152
- architecture: AMD64
151153
macos-version: "15-intel"
154+
large-memory-tests: "OFF"
152155
- architecture: ARM64
153156
macos-version: "14"
157+
large-memory-tests: "ON"
154158
env:
155159
ARROW_HOME: /tmp/local
156160
ARROW_AZURE: ON
@@ -173,7 +177,8 @@ jobs:
173177
ARROW_WITH_SNAPPY: ON
174178
ARROW_WITH_BROTLI: ON
175179
ARROW_BUILD_TESTS: OFF
176-
PYARROW_TEST_LARGE_MEMORY: ON
180+
PYARROW_TEST_LARGE_MEMORY: ${{ matrix.large-memory-tests }}
181+
PYTEST_ARGS: "-n auto --durations=40"
177182
# Current oldest supported version according to https://endoflife.date/macos
178183
MACOSX_DEPLOYMENT_TARGET: 12.0
179184
steps:

0 commit comments

Comments
 (0)