[pull] master from GaijinEntertainment:master#1007
Merged
Conversation
Closes most of the gap to libc++ std::sort and beats libstdc++
std::sort 19/20 cells. Header moves to include/daScript/simulate/
so aot.h can wire the typed das_sort<T> into the workhorse binding
path. Bench gets a parallel libstdc++ build target so we can A/B
both stdlibs from one tree.
Algorithm changes in include/daScript/simulate/das_qsort_r.h
(promoted from src/builtin/):
- size_t indices throughout (>2G element support)
- byte_swap sized dispatch for w ∈ {4,8,16,32,64,128,256}, chunked
memcpy fallback for the rest. 30-140× faster than the generic
loop at common widths (micro-bench in examples/sort/bench_byte_swap.cpp)
- New das_block_partition_r: byte-pointer port of libc++
__bitset_partition. Populate a uint64_t mask of comparison
outcomes for 64 elements branchlessly, then drive swaps with
countr_zero. Cuts mispredictions from ~32/partition to ~1/64
on random data
- das_qsort_r is now hybrid: block partition for hi-lo ≥ 128, Hoare
for smaller ranges. Median-of-3 pivot placed at data[lo] for both
paths
- New das_sort<T, Compare> + das_sort_block<T, Compare>: typed
mirrors of the byte-pointer impls. Same algorithm shape using
std::swap and typed indexing. Provides the apples-to-apples peer
for std::sort and the daslang typed-binding entry point
- sized_memcpy helper for hole-sliding sift_down (das_sift_down_r)
inner loop. Per-level memcpy at known struct widths lowers to a
single SIMD load/store pair
- das_heapsort_helper_r / das_make_heap_r / das_push_heap_r /
das_pop_heap_r unchanged (Phase 0 winners — hole-sliding sift
was already the bake-off champion for those)
Daslang binding (include/daScript/simulate/aot.h): the 10 typed-sort
call sites in scblk / scblk_array / builtin_sort_cblock switch
from unqualified sort() (== std::sort via using namespace std) to
das_sort. Linux/libstdc++ users gain ~1.5× on typed sorts; Mac/libc++
users see no regression because compile-time constant propagation
through sizeof(T) already specializes our template to match libc++
performance on workhorse types and beats it on struct types.
Bench infrastructure (examples/sort/):
- bench_sort_family.cpp: 5-arm sort deep-dive table (std::sort,
C qsort, das_qsort_r, das_qsort_block_r, das_sort<T>,
das_sort_block<T>), correctness verification on every candidate,
stdlib + compiler print
- bench_byte_swap.cpp: new standalone micro-bench for the
byte_swap primitive (chunked256, chunked64, words64-kernel-style,
sized-dispatch, hybrid)
- CMakeLists.txt: optional parallel libstdc++ build target (gated
on g++-N availability) so a single configure produces both
libc++ and libstdc++ binaries
Phase 0.1 bake-off scaffolding (the candidates from Phase 0.1 —
introsort, pdqsort-lite Hoare variant, Lomuto introselect, Floyd
two-phase sift, ternary qsort, etc.) is not retained in the final
header. Final state is: byte-pointer block-partition pdqsort
hybrid, typed mirror, byte-pointer hole-sliding heap ops, byte-pointer
heap-of-N partial_sort, byte-pointer Hoare-introselect nth_element.
Headline benchmarks at N=100K (M-series Mac):
vs libc++ std::* (pdqsort + block-partition):
- 9/20 wins, including all of nth_element (0.64-0.74×),
sort/struct types (0.61-0.91×), make_heap/int32 (0.95×)
- Losses: sort/workhorse (1.37-1.38×), heap_sort/big structs
(1.12×)
vs libstdc++ std::* (Musser introsort):
- 19/20 wins. Only heap_sort/P128 ties (1.01×). Across the board
we beat libstdc++ 1.1-1.8× — Musser introsort hasn't been updated
to pdqsort upstream
Daslang runtime:
- sort_struct_by_key/100K cblock path: 281 → 255 ns/op (9% faster)
- m3_topn_array/100K (top_n_by) = 38 ns/op, matches SQLite's
ORDER BY ... LIMIT 10 at 37 ns/op (LINQ-vs-SQL parity restored)
Verification: ctest 29/29; tests/linq/test_linq_sorting.das 59/59;
full dastest 8378 tests (8372 pass, 6 skipped, 0 failures — identical
to master baseline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cache the questions that took >5 min of in-session research so future sessions can answer them in 1 ask: PR #2707 (sort family bake-off) findings: - byte-swap-micro-win-invisible-under-cblock-dominance - das-qsort-r-vs-std-perf-comparison - libcxx-stdsort-block-partition-pdqsort - qsort-byte-swap-implementations-survey - standalone-example-no-daslang-link - what-daslib-operations-exist-for-partial-sort-nth-element-heap-ops-and-top-n-selection - what-s-the-right-anti-dce-pattern-for-a-c-microbenchmark-inner-loop-so-the-optimizer-can-t-elide-it - where-are-the-cross-compiler-bit-scan-and-popcount-helpers-in-daslang-s-c-headers Doc-CI iteration findings: - sphinx-w-fails-on-my-pr-branch-with-undefined-label-struct-module-x-but-master-ci-is-green-... - what-ci-checks-must-pass-when-i-regenerate-doc-source-stdlib-via-das2rst-das Site-deploy gotcha: - why-does-a-new-top-level-html-page-e-g-daspkg-html-added-under-site-404-on-daslang-io-after-merging-to-master Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-bakeoff Sort family: block-partition pdqsort + typed das_sort<T>
Override sphinx_rtd_theme's sidebartitle block so the orange `> daslang.io` logo links to https://daslang.io instead of pathto(_root_doc) (which is a self-link on the docs index). Mirrors upstream block at sphinx_rtd_theme/layout.html with the <a href> swapped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ven-sort site/blog: add "Do you even sort?" post
…-2026-05-18 mouse-data: 11 new cards from sort-family + doc-CI sessions
…skuriakova-afdc9d doc: clickable daslang.io banner in sphinx sidebar
PR #2707 (sort-family) swapped daslang's qsort to block-partition pdqsort. That changed the tie-break order among equal sort keys, which broke das2rst-driven module docs in a non-obvious way: daslib/rst.das:1882 sorts `grp.func` by function_name only — `$(a,b) => function_name(a.fn) < function_name(b.fn)`. For overloaded functions the comparator returns false both ways (equal key), so which overload comes out "first" depends on qsort's internal tie-break. Downstream, the loop at lines 1912-1929 stamps `is_overload = (cur_name == prev_func_name)`. The first overload in iteration order gets `is_overload=false` → full :Arguments: emission with :ref: to each param type. Different overloads use different param types, so the choice of "first" decides which :ref: targets the page references. Symptom: dasImgui CI's sphinx-build -W failed with `undefined label: 'alias-imvec4'` in doc/source/stdlib/generated/imgui_style_builtin.rst — after #2707, push_style_one(ImGuiCol; ImVec4) now wins the detailed slot, and daslib/rst.das describe_type() emits a :ref:`ImVec4 <alias-imvec4>` for any TypeDecl whose `td.alias` is non-empty (set by the C++ binding `t->alias = "ImVec4"`). The alias label is never defined — `:ref:` to nowhere — sphinx-build -W exits non-zero. This was a latent bug: rst.das relied on unstable sort tie-breaking (see daslang qsort-is-not-stable lore). #2707 just exposed it. Fix: sort by the full signature string (rst_describe_function_short) instead of just the function name. The string starts with the function name, so name-alphabetical primary order is preserved, and overloads sort deterministically by signature within each name-run. Regenerated 77 doc/source/stdlib/handmade/function-*.rst entries — the new "first detailed" overload per name-run across math, builtin, ast, ast_boost, raster, strings, pugixml, debugapi, dashv, rtti, strings_boost, uriparser. Each stub filled by copying the closest signature-matched sibling's description; math overloads hand-checked for vector/scalar semantic drift (mad fusion claim dropped, round nearest-even claim qualified, identity 3x3 wording trimmed). Sphinx -W --keep-going -b html builds clean (0 warnings). das2rst.das re-run is idempotent (no new stubs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…break daslib/rst: deterministic overload sort tiebreak (unblocks dasImgui doc CI)
…ere splice Restructures _fold(chain) into a three-tier cascade: 1. splice — fused for-loop, lambdas inlined (hot patterns) 2. fallback — fold_linq_default: array-shape pipeline with _inplace reuse + delete 3. raw — clone_expression passthrough All tiers preserve semantics; _fold(chain) is observationally equivalent to chain. Obviates the previously-planned Phase 2D fail-loudly contract. Phase 1 retirement: - _old_fold macro deleted (everywhere: macro, helpers, header refs, bench files) - g_foldSeq dispatch table + 7 FoldSequence patterns deleted (fold_where_count, fold_where_select, fold_select_where, fold_where, fold_select, fold_order_distinct x2) — splice arms cover every shape they recognized - recursiveMacroName param dropped from fold_linq_default; hardcoded to "_fold" - where__to_array double-underscore rename bug fixed (callName ends_with "_") Phase 3 new splice arms (plan_order_family): - bare arr |> order[_by]?[_descending]? → direct call (drops iterator wrapper) - src |> order[_by]?[_descending]? |> take(K) → top_n[_by][_descending] - src |> where_*(p)+ |> order*(key?) → fused prefilter buffer + sort_inplace - src |> where_*(p)+ |> order*(key?) |> take(K) → fused prefilter + top_n* Phase 3d first select+where splice (was blocked since Phase 2A): - daslib/templates_boost.das: new replaceVariablePeeling helper that peels the typer-inserted ExprRef2Value wrapper during substitution into typed AST (mirrors qm_peel_ref2value in daslib/ast_match) - daslib/linq_fold.das: fold_linq_cond_peel uses the new helper to splice select(proj) |> where(pred) into a fused predicate, bailing to tier 2 when has_sideeffects(proj) to avoid double-evaluation. All four terminator lanes covered: array / counter / accumulator / early-exit. Phase 2 library additions: - daslib/linq.das: top_n_by_descending and top_n_descending (array + iterator source variants each) — mirror top_n_by / top_n with flipped comparator (partial_sort + reversed less for array; bounded min-heap for iterator) - linqCalls dict registers top_n / top_n_by / top_n_descending / top_n_by_descending so flatten_linq recognizes them Concurrent runtime fix: - src/builtin/module_builtin_runtime_sort.cpp:84 builtin_sort_string switched from unqualified sort() (= std::sort via using namespace std) to das_sort (block-partition pdqsort from PR #2707). The runtime path order_by<string> takes; on Linux/libstdc++ users see the same ~1.5x speedup PR #2707 brought to typed sorts. Benchmarks (100K rows, INTERP, m3 vs m3f, smaller better): - order_take_desc: m3 698 → m3f 56 ns/op (12.5x — new top_n_by_descending) - sort_take: m3 713 → m3f 56 ns/op (12.7x — top_n_by via splice) - select_where_order_take m3 354 → m3f 39 ns/op (9.1x — fused prefilter+top_n_by) - select_where_count: m3 57 → m3f 5 ns/op (11.4x — Phase 3d peel) - chained_where: m3 45 → m3f 6 ns/op (7.5x) - bare_order_where: m3 357 → m3f 340 ns/op (1.05x — sort dominates) Three new bench files (bare_order_where, order_take_desc, select_where_count) + m3f_old column dropped from all 29 existing files + 2 new top_n test funcs (13 subtests across array+iterator sources, including N=1, N=0, N>length, struct types, parity vs hand-rolled reference) + new plan_order_family + Phase 3d AST shape tests in test_linq_fold_ast.das. Tests: 8393/8393 dastest; 7782 AOT, all pass. Sphinx -W clean. detect-dupe clean (siblings-by-design only). Modeled on PR #2707 (single squashed commit, multi-area bundle, headline numbers in PR body). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `daspkg install` / `update` / `build` invoked `cmake --build` without `--parallel`, so on generators whose default is single-job (MSBuild on Windows, Make on Linux/macOS) the build ran serially. Adding `--parallel` lets CMake pick a sensible per-generator default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ort-family linq_fold: retire _old_fold; 3-tier cascade; order-family + select+where splice
…llel-build daspkg: parallelize cmake build in build_package
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )