Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

cakedev0 · 2025-09-03T20:18:33Z

This PR re-implements the way DecisionTreeRegressor(criterion='absolute_error') works underneath for optimization purposes. The current algorithm for calculating the AE of a split incures a O(n^2) overall complexity for building a tree which quickly becomes impractical. My implementation makes it O(n log n) making it tremendously faster.

For instance with d=2, n=100_000 and max_depth=1 (just one split), the execution time went from ~30s to ~100ms on my machine.

Referenced Issues

Fixes #9626 by reducing the complexity from O(n^2) to O(n log n).
Also fixes #32099 & #10725 (which are probably duplicates). But that's more of a side effect of re-implementing completely the criterion logic for MAE.

Supersedes #11649 (which was opened to fix #10725 7 years ago but never merged).

Explanation of my changes

The changes focus solely on the class MAE(RegressionCriterion).

Previous implementation had O(n^2) overall complexity emerging from several methods in this class

in update: O(n) cost due to updating a data structure that maintains data sorted (WeightedMedianCalculator/WeightedPQueue). Called O(n) times to find the best split => O(n^2) overall
in children_impurity: O(n) due to looping over all the data points. Called O(n) times to find the best split => O(n^2) overall

Those can't really be fixed by small local changes, as overall, the algorithm is O(n^2) independently of how you implement it. Hence a complete rewrite was needed. As discussed in this technical report I made, there are several efficient algorithms to solve the problem (computing the absolute errors for all the possible splits along one feature).

The one I chose initially was an intuitive adaptation of the well-known two-heap solution of the "find median from a data stream" problem. But even if it had a O(n log n) expected complexity, it can be O(n^2 log n) in some pathological cases. So after some discussions, it was chosen to implement an other solution: the "Fenwick tree option". This solution is based on a Fenwick tree, a data-structure specialized in efficient prefix sums computations and updates.

See the technical report for detailed explanation of the algorithm, but in short, the main steps are:

insert a new element (y, w) in the tree, and search by prefix sum to find the weighted median: O(log n)
rewrite the AE computation by taking advantage of the following calculations:
$\sum_i w_i | y_i - m | = \sum_{y_i >= m} w_i(y_i - m) + \sum_{y_i < m} w_i(m - y_i) $
$= \sum_{y_i >= m} w_i y_i - m \sum_{y_i >= m} w_i + m \sum_{y_i < m} w_i - \sum_{y_i < m} w_i y_i $
the value of those 4 prefix/suffix-sums can be found while searching for the median in the tree, and once you have those, the computation becomes O(1).

Iterate on the data from left to right to compute the AE for every possible left child. And iterate from right to left to compute the AE for every possible right child.

This logic is implemented in tree/_criterion.pyx::precompute_absolute_errors as I wanted to be able to unit test it.

After some research I found a paper about the same problem. Their approach uses the two heaps idea and generalizes to arbitrary quantiles (as done in my follow-up PR), but it does not handle weighted samples. Also, the paper uses a more elaborate formula for the absolute error/loss computation than mine, TBH it looks unnecessarily complex.

…dded print everywhere to debug; fixed some bugs

…al PR but not all

github-actions · 2025-09-03T20:19:31Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 50896ae. Link to the linter CI: here}

cakedev0 · 2025-09-04T16:49:10Z

sklearn/tree/_utils.pyx

+# MAE split precomputations algorithm
+# =============================================================================

-def _any_isnan_axis0(const float32_t[:, :] X):


I moved this one up, in the helpers section.

sklearn/tree/tests/test_tree.py

adrinjalali · 2025-09-08T07:43:23Z

@adam2392 could you please have a look here?

adam2392

First of all. Thanks @cakedev0 for taking a look at this challenging, but impactful issue, and proposing a fix.

I took an initial glance. This overall looks like the right direction to me, so I want to make sure others take a look before we dive into the nitty stuff of making the PR mergable, and maintainable.

I have an open q: For decision trees, we can imagine imposing a quantile-criterion split (e.g. the pinball loss). Naively, I think we can make the WeightedHeaps work to maintain any sort of quantile right?

Perhaps @thomasjpfan wants to take a look as well before we dive deeper into the code.

sklearn/tree/_criterion.pxd

sklearn/tree/_criterion.pyx

Co-authored-by: Olivier Grisel <[email protected]>

…to mae-split-optim

ogrisel

LGTM (besides nitpicks below)! Thanks for the great PR.

I let @adam2392 do the merge if he is still +1 for merge after the latest changes.

Possible follow-ups:

generalize to regression for an arbitrary quantile;
add support for missing values (if not overly complex).

sklearn/tree/_utils.pyx

sklearn/tree/_criterion.pyx

Co-authored-by: Olivier Grisel <[email protected]>

cakedev0 · 2025-10-28T10:11:50Z

add support for missing values (if not overly complex).

Actually, this is very simple (even simplifies the current code base), and has nothing to do with criteria. Criteria don't interact with feature values, just with the target values and their ordering via sample_indices, so they shouldn't have anything to do with missing values in features. See my PR: #32119

ogrisel · 2025-10-28T10:50:03Z

I missed that PR despite the notification...

…to mae-split-optim

betatim · 2025-11-12T09:34:50Z

I resolved the conflict and looked at the change. I'm not super familiar with the old code, so I focussed on looking at how the new code looks. It seems fine to me. I learnt about Fenwick trees :D

While the change is quite large (both in diff size and speed up!) it does not change existing tests. It adds to an existing test and adds new tests. So I think we can be quite sure that we won't break existing users and only deliver speed ups.

adam2392 · 2025-11-12T11:56:51Z

I will attempt to look at it this week! It's in my queue that keeps queueing 😅

ogrisel · 2025-11-12T14:52:09Z

Let me merge to get this in 1.8 given the fact that we have already 2 recent positive reviews. @adam2392 feel free to open a follow-up PR with incremental improvements or fixes if needed.

ogrisel · 2025-11-12T14:53:09Z

Thanks very much @cakedev0! This is great work!

ogrisel · 2025-11-12T14:54:02Z

We also need to review #32119 BTW ;)

jjerphan · 2025-11-12T15:11:19Z

Thank you very much, @cakedev0! 👏

adam2392 · 2025-11-12T15:40:37Z

Awesome. Thanks @cakedev0 !

cakedev0 · 2025-11-12T19:11:38Z

🎉 🎉

Thanks to everyone who spent time reviewing this!!

GaelVaroquaux · 2025-11-12T21:02:37Z

Congratulations, @cakedev0, this is very cool!!

lorentzenchr · 2025-11-13T10:16:13Z

@cakedev0 congrats to this remarkable achievement.

Two „nitpicks“:

In my PRs I‘m often asked to proof that the model (after fit) does not change. @cakedev0 Could you provide this evidence?
The fly in the ointment: The title should have been „ENH..“ instead of „FIX..“.

cakedev0 · 2025-11-13T11:14:33Z

Thanks!

The previous implementation was buggy, see issues #32099 & #10725. So the model does change with this PR.

Indeed, it should have been ENH ^^ (though it's a fix too, but it's a side effect).

lorentzenchr · 2025-11-13T12:09:56Z

@cakedev0 Thanks for clarifying. The whatsnew entry contains both the efficiency enhancement and the fix. That‘s perfect!

cakedev0 added 10 commits September 2, 2025 22:40

First draft, needs tests & fixes

f7cf7d6

Merge remote-tracking branch 'upstream/main' into mae-split-optim

7061ff6

fixed compilation errors

f4edaa2

fixed compilation errors

01fd9b2

Moved AE computation in external helper to be able to unit-test it; a…

3f87b99

…dded print everywhere to debug; fixed some bugs

WIP some additional tests that helped me, some will be kept in my fin…

e8adf96

…al PR but not all

tests cleanup

4ed868e

cleanup

83d89a4

cleanup

1ca34bf

Merge remote-tracking branch 'upstream/main' into mae-split-optim

43692f7

github-actions bot added cython module:tree labels Sep 3, 2025

cakedev0 added 9 commits September 3, 2025 22:30

WIP fixing linting issues

d463558

fixed linting

fa993d4

fix spelling

cbf5405

Added test that would fail before this PR

a4bd310

added changed logs

f4a0e07

cleanup

a86a190

comments & cleanups

092af65

slight refactor of class inheritance

4a12dea

Merge remote-tracking branch 'upstream/main' into mae-split-optim

b44fb2b

cakedev0 commented Sep 4, 2025

View reviewed changes

cakedev0 marked this pull request as ready for review September 4, 2025 16:52

cakedev0 commented Sep 4, 2025

View reviewed changes

sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved

cakedev0 mentioned this pull request Sep 4, 2025

DecisionTreeRegressor with absolute error criterion: non-optimal split #32099

Closed

cakedev0 commented Sep 4, 2025

View reviewed changes

sklearn/tree/tests/test_tree.py Show resolved Hide resolved

adam2392 self-requested a review September 9, 2025 00:02

adam2392 reviewed Sep 9, 2025

View reviewed changes

cakedev0 and others added 4 commits October 27, 2025 21:49

Update sklearn/tree/tests/test_fenwick.py

64a3516

Co-authored-by: Olivier Grisel <[email protected]>

adressed comments on test fenwick

70277d3

Merge branch 'mae-split-optim' of github.com:cakedev0/scikit-learn in…

2942c14

…to mae-split-optim

a lot of doc/comments

3e60afd

ogrisel approved these changes Oct 28, 2025

View reviewed changes

sklearn/tree/_utils.pyx Outdated Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

Apply suggestions from code review

10c7dde

Co-authored-by: Olivier Grisel <[email protected]>

cakedev0 added 2 commits October 29, 2025 10:12

Merge remote-tracking branch 'upstream/main' into mae-split-optim

fce6df9

Merge branch 'mae-split-optim' of github.com:cakedev0/scikit-learn in…

d12585d

…to mae-split-optim

adam2392 self-requested a review October 29, 2025 18:10

Merge branch 'main' into mae-split-optim

50896ae

betatim approved these changes Nov 12, 2025

View reviewed changes

ogrisel merged commit 9a0bb25 into scikit-learn:main Nov 12, 2025
38 checks passed

VajiraL mentioned this pull request Nov 13, 2025

Port scikit-learn absolute_error criterion performance improvements to quantile-forest zillow/quantile-forest#138

Open

cakedev0 mentioned this pull request Nov 13, 2025

Fix Tree Median Calculation for MAE criterion #11649

Closed

This was referenced Nov 24, 2025

Fix test_monotonic_tree CI segfault on win_arm64 #32754

Closed

CI Intermittent segmentation fault in Windows arm64 wheels test (vanilla CPython) #32491

Closed

Uh oh!

Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

Uh oh!

Conversation

cakedev0 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Referenced Issues

Explanation of my changes

Uh oh!

github-actions bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

cakedev0 Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adrinjalali commented Sep 8, 2025

Uh oh!

adam2392 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cakedev0 commented Oct 28, 2025

Uh oh!

ogrisel commented Oct 28, 2025

Uh oh!

betatim commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam2392 commented Nov 12, 2025

Uh oh!

ogrisel commented Nov 12, 2025

Uh oh!

Uh oh!

ogrisel commented Nov 12, 2025

Uh oh!

ogrisel commented Nov 12, 2025

Uh oh!

jjerphan commented Nov 12, 2025

Uh oh!

adam2392 commented Nov 12, 2025

Uh oh!

cakedev0 commented Nov 12, 2025

Uh oh!

GaelVaroquaux commented Nov 12, 2025

Uh oh!

lorentzenchr commented Nov 13, 2025

Uh oh!

cakedev0 commented Nov 13, 2025

Uh oh!

lorentzenchr commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

cakedev0 commented Sep 3, 2025 •

edited

Loading

github-actions bot commented Sep 3, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading

betatim commented Nov 12, 2025 •

edited

Loading