Skip to content

[pull] master from GaijinEntertainment:master#1007

Merged
pull[bot] merged 14 commits into
forksnd:masterfrom
GaijinEntertainment:master
May 18, 2026
Merged

[pull] master from GaijinEntertainment:master#1007
pull[bot] merged 14 commits into
forksnd:masterfrom
GaijinEntertainment:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 18, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

borisbat and others added 14 commits May 17, 2026 21:07
Closes most of the gap to libc++ std::sort and beats libstdc++
std::sort 19/20 cells. Header moves to include/daScript/simulate/
so aot.h can wire the typed das_sort<T> into the workhorse binding
path. Bench gets a parallel libstdc++ build target so we can A/B
both stdlibs from one tree.

Algorithm changes in include/daScript/simulate/das_qsort_r.h
(promoted from src/builtin/):

- size_t indices throughout (>2G element support)
- byte_swap sized dispatch for w ∈ {4,8,16,32,64,128,256}, chunked
  memcpy fallback for the rest. 30-140× faster than the generic
  loop at common widths (micro-bench in examples/sort/bench_byte_swap.cpp)
- New das_block_partition_r: byte-pointer port of libc++
  __bitset_partition. Populate a uint64_t mask of comparison
  outcomes for 64 elements branchlessly, then drive swaps with
  countr_zero. Cuts mispredictions from ~32/partition to ~1/64
  on random data
- das_qsort_r is now hybrid: block partition for hi-lo ≥ 128, Hoare
  for smaller ranges. Median-of-3 pivot placed at data[lo] for both
  paths
- New das_sort<T, Compare> + das_sort_block<T, Compare>: typed
  mirrors of the byte-pointer impls. Same algorithm shape using
  std::swap and typed indexing. Provides the apples-to-apples peer
  for std::sort and the daslang typed-binding entry point
- sized_memcpy helper for hole-sliding sift_down (das_sift_down_r)
  inner loop. Per-level memcpy at known struct widths lowers to a
  single SIMD load/store pair
- das_heapsort_helper_r / das_make_heap_r / das_push_heap_r /
  das_pop_heap_r unchanged (Phase 0 winners — hole-sliding sift
  was already the bake-off champion for those)

Daslang binding (include/daScript/simulate/aot.h): the 10 typed-sort
call sites in scblk / scblk_array / builtin_sort_cblock switch
from unqualified sort() (== std::sort via using namespace std) to
das_sort. Linux/libstdc++ users gain ~1.5× on typed sorts; Mac/libc++
users see no regression because compile-time constant propagation
through sizeof(T) already specializes our template to match libc++
performance on workhorse types and beats it on struct types.

Bench infrastructure (examples/sort/):
- bench_sort_family.cpp: 5-arm sort deep-dive table (std::sort,
  C qsort, das_qsort_r, das_qsort_block_r, das_sort<T>,
  das_sort_block<T>), correctness verification on every candidate,
  stdlib + compiler print
- bench_byte_swap.cpp: new standalone micro-bench for the
  byte_swap primitive (chunked256, chunked64, words64-kernel-style,
  sized-dispatch, hybrid)
- CMakeLists.txt: optional parallel libstdc++ build target (gated
  on g++-N availability) so a single configure produces both
  libc++ and libstdc++ binaries

Phase 0.1 bake-off scaffolding (the candidates from Phase 0.1 —
introsort, pdqsort-lite Hoare variant, Lomuto introselect, Floyd
two-phase sift, ternary qsort, etc.) is not retained in the final
header. Final state is: byte-pointer block-partition pdqsort
hybrid, typed mirror, byte-pointer hole-sliding heap ops, byte-pointer
heap-of-N partial_sort, byte-pointer Hoare-introselect nth_element.

Headline benchmarks at N=100K (M-series Mac):

vs libc++ std::* (pdqsort + block-partition):
- 9/20 wins, including all of nth_element (0.64-0.74×),
  sort/struct types (0.61-0.91×), make_heap/int32 (0.95×)
- Losses: sort/workhorse (1.37-1.38×), heap_sort/big structs
  (1.12×)

vs libstdc++ std::* (Musser introsort):
- 19/20 wins. Only heap_sort/P128 ties (1.01×). Across the board
  we beat libstdc++ 1.1-1.8× — Musser introsort hasn't been updated
  to pdqsort upstream

Daslang runtime:
- sort_struct_by_key/100K cblock path: 281 → 255 ns/op (9% faster)
- m3_topn_array/100K (top_n_by) = 38 ns/op, matches SQLite's
  ORDER BY ... LIMIT 10 at 37 ns/op (LINQ-vs-SQL parity restored)

Verification: ctest 29/29; tests/linq/test_linq_sorting.das 59/59;
full dastest 8378 tests (8372 pass, 6 skipped, 0 failures — identical
to master baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cache the questions that took >5 min of in-session research so future
sessions can answer them in 1 ask:

PR #2707 (sort family bake-off) findings:
- byte-swap-micro-win-invisible-under-cblock-dominance
- das-qsort-r-vs-std-perf-comparison
- libcxx-stdsort-block-partition-pdqsort
- qsort-byte-swap-implementations-survey
- standalone-example-no-daslang-link
- what-daslib-operations-exist-for-partial-sort-nth-element-heap-ops-and-top-n-selection
- what-s-the-right-anti-dce-pattern-for-a-c-microbenchmark-inner-loop-so-the-optimizer-can-t-elide-it
- where-are-the-cross-compiler-bit-scan-and-popcount-helpers-in-daslang-s-c-headers

Doc-CI iteration findings:
- sphinx-w-fails-on-my-pr-branch-with-undefined-label-struct-module-x-but-master-ci-is-green-...
- what-ci-checks-must-pass-when-i-regenerate-doc-source-stdlib-via-das2rst-das

Site-deploy gotcha:
- why-does-a-new-top-level-html-page-e-g-daspkg-html-added-under-site-404-on-daslang-io-after-merging-to-master

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-bakeoff

Sort family: block-partition pdqsort + typed das_sort<T>
Override sphinx_rtd_theme's sidebartitle block so the orange `> daslang.io`
logo links to https://daslang.io instead of pathto(_root_doc) (which is a
self-link on the docs index). Mirrors upstream block at
sphinx_rtd_theme/layout.html with the <a href> swapped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ven-sort

site/blog: add "Do you even sort?" post
…-2026-05-18

mouse-data: 11 new cards from sort-family + doc-CI sessions
…skuriakova-afdc9d

doc: clickable daslang.io banner in sphinx sidebar
PR #2707 (sort-family) swapped daslang's qsort to block-partition pdqsort.
That changed the tie-break order among equal sort keys, which broke
das2rst-driven module docs in a non-obvious way:

  daslib/rst.das:1882 sorts `grp.func` by function_name only —
  `$(a,b) => function_name(a.fn) < function_name(b.fn)`. For overloaded
  functions the comparator returns false both ways (equal key), so which
  overload comes out "first" depends on qsort's internal tie-break.

Downstream, the loop at lines 1912-1929 stamps `is_overload = (cur_name
== prev_func_name)`. The first overload in iteration order gets
`is_overload=false` → full :Arguments: emission with :ref: to each
param type. Different overloads use different param types, so the
choice of "first" decides which :ref: targets the page references.

Symptom: dasImgui CI's sphinx-build -W failed with
`undefined label: 'alias-imvec4'` in
doc/source/stdlib/generated/imgui_style_builtin.rst — after #2707,
push_style_one(ImGuiCol; ImVec4) now wins the detailed slot, and
daslib/rst.das describe_type() emits a :ref:`ImVec4 <alias-imvec4>`
for any TypeDecl whose `td.alias` is non-empty (set by the C++ binding
`t->alias = "ImVec4"`). The alias label is never defined — `:ref:` to
nowhere — sphinx-build -W exits non-zero.

This was a latent bug: rst.das relied on unstable sort tie-breaking
(see daslang qsort-is-not-stable lore). #2707 just exposed it.

Fix: sort by the full signature string (rst_describe_function_short)
instead of just the function name. The string starts with the function
name, so name-alphabetical primary order is preserved, and overloads
sort deterministically by signature within each name-run.

Regenerated 77 doc/source/stdlib/handmade/function-*.rst entries —
the new "first detailed" overload per name-run across math, builtin,
ast, ast_boost, raster, strings, pugixml, debugapi, dashv, rtti,
strings_boost, uriparser. Each stub filled by copying the closest
signature-matched sibling's description; math overloads hand-checked
for vector/scalar semantic drift (mad fusion claim dropped, round
nearest-even claim qualified, identity 3x3 wording trimmed).

Sphinx -W --keep-going -b html builds clean (0 warnings).
das2rst.das re-run is idempotent (no new stubs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…break

daslib/rst: deterministic overload sort tiebreak (unblocks dasImgui doc CI)
…ere splice

Restructures _fold(chain) into a three-tier cascade:
  1. splice    — fused for-loop, lambdas inlined (hot patterns)
  2. fallback  — fold_linq_default: array-shape pipeline with _inplace reuse + delete
  3. raw       — clone_expression passthrough

All tiers preserve semantics; _fold(chain) is observationally equivalent to
chain. Obviates the previously-planned Phase 2D fail-loudly contract.

Phase 1 retirement:
- _old_fold macro deleted (everywhere: macro, helpers, header refs, bench files)
- g_foldSeq dispatch table + 7 FoldSequence patterns deleted (fold_where_count,
  fold_where_select, fold_select_where, fold_where, fold_select,
  fold_order_distinct x2) — splice arms cover every shape they recognized
- recursiveMacroName param dropped from fold_linq_default; hardcoded to "_fold"
- where__to_array double-underscore rename bug fixed (callName ends_with "_")

Phase 3 new splice arms (plan_order_family):
- bare arr |> order[_by]?[_descending]?  → direct call (drops iterator wrapper)
- src |> order[_by]?[_descending]?  |> take(K)  → top_n[_by][_descending]
- src |> where_*(p)+ |> order*(key?)  → fused prefilter buffer + sort_inplace
- src |> where_*(p)+ |> order*(key?) |> take(K)  → fused prefilter + top_n*

Phase 3d first select+where splice (was blocked since Phase 2A):
- daslib/templates_boost.das: new replaceVariablePeeling helper that peels the
  typer-inserted ExprRef2Value wrapper during substitution into typed AST
  (mirrors qm_peel_ref2value in daslib/ast_match)
- daslib/linq_fold.das: fold_linq_cond_peel uses the new helper to splice
  select(proj) |> where(pred) into a fused predicate, bailing to tier 2 when
  has_sideeffects(proj) to avoid double-evaluation. All four terminator lanes
  covered: array / counter / accumulator / early-exit.

Phase 2 library additions:
- daslib/linq.das: top_n_by_descending and top_n_descending (array + iterator
  source variants each) — mirror top_n_by / top_n with flipped comparator
  (partial_sort + reversed less for array; bounded min-heap for iterator)
- linqCalls dict registers top_n / top_n_by / top_n_descending /
  top_n_by_descending so flatten_linq recognizes them

Concurrent runtime fix:
- src/builtin/module_builtin_runtime_sort.cpp:84  builtin_sort_string switched
  from unqualified sort() (= std::sort via using namespace std) to das_sort
  (block-partition pdqsort from PR #2707). The runtime path order_by<string>
  takes; on Linux/libstdc++ users see the same ~1.5x speedup PR #2707 brought
  to typed sorts.

Benchmarks (100K rows, INTERP, m3 vs m3f, smaller better):
- order_take_desc:        m3 698 → m3f 56 ns/op  (12.5x — new top_n_by_descending)
- sort_take:              m3 713 → m3f 56 ns/op  (12.7x — top_n_by via splice)
- select_where_order_take m3 354 → m3f 39 ns/op  (9.1x  — fused prefilter+top_n_by)
- select_where_count:     m3  57 → m3f  5 ns/op  (11.4x — Phase 3d peel)
- chained_where:          m3  45 → m3f  6 ns/op  (7.5x)
- bare_order_where:       m3 357 → m3f 340 ns/op (1.05x — sort dominates)

Three new bench files (bare_order_where, order_take_desc, select_where_count)
+ m3f_old column dropped from all 29 existing files + 2 new top_n test funcs
(13 subtests across array+iterator sources, including N=1, N=0, N>length,
struct types, parity vs hand-rolled reference) + new plan_order_family + Phase
3d AST shape tests in test_linq_fold_ast.das.

Tests: 8393/8393 dastest; 7782 AOT, all pass. Sphinx -W clean. detect-dupe
clean (siblings-by-design only). Modeled on PR #2707 (single squashed commit,
multi-area bundle, headline numbers in PR body).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `daspkg install` / `update` / `build` invoked `cmake --build`
without `--parallel`, so on generators whose default is single-job
(MSBuild on Windows, Make on Linux/macOS) the build ran serially.
Adding `--parallel` lets CMake pick a sensible per-generator default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ort-family

linq_fold: retire _old_fold; 3-tier cascade; order-family + select+where splice
…llel-build

daspkg: parallelize cmake build in build_package
@pull pull Bot locked and limited conversation to collaborators May 18, 2026
@pull pull Bot added the ⤵️ pull label May 18, 2026
@pull pull Bot merged commit f2b24fb into forksnd:master May 18, 2026
@pull pull Bot had a problem deploying to github-pages May 18, 2026 08:58 Error
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant