Tests related to ClassificationNeuralNetwork very slow on Debian x86-64 CI runners #168

svillemot · 2025-01-13T13:17:27Z

The tests related to ClassificationNeuralNetwork are extremely slow compared to the others on Debian x86-64 CI runners.

See for example those logs which have timings prepended to each log line:
https://ci.debian.net/packages/o/octave-statistics/unstable/amd64/56397827/
https://ci.debian.net/packages/o/octave-statistics/unstable/amd64/55938923/

This leads to timeouts on the CI runners.

Curiously, the problem does not manifest on other processor architectures (not even on x86-32).

Do you have any idea of what might be causing the problem?

The text was updated successfully, but these errors were encountered:

pr0m1th3as · 2025-01-13T17:40:36Z

The classification classdefs contain several tests each and some of them take reasonable amount of time, because they contain a number of tests that apply training apart from the usual input validation and error checking BISTs.
On my machine (Intel® Core™ i7-10710U CPU @ 1.10GHz × 12 with Ubuntu 20.04LTS) they take the following time:

../statistics-1.7.0/Classification/ClassificationDiscriminant.m  pass   65/65   [ 1.699s/  1.669s]
../packages/statistics-1.7.0/Classification/ClassificationGAM.m  pass   34/34   [32.343s / 31.469s]
../packages/statistics-1.7.0/Classification/ClassificationKNN.m  pass  162/162  [ 3.726s /  3.846s]
..statistics-1.7.0/Classification/ClassificationNeuralNetwork.m  pass   59/59   [13.653s /  1.265s]
..tistics-1.7.0/Classification/ClassificationPartitionedModel.m  pass   19/19   [94.502s / 31.963s]
../packages/statistics-1.7.0/Classification/ClassificationSVM.m  pass  114/114  [ 1.240s /  1.166s]
..tics-1.7.0/Classification/CompactClassificationDiscriminant.m  pass   28/28   [ 0.232s /  0.229s]
..es/statistics-1.7.0/Classification/CompactClassificationGAM.m  pass   10/10   [11.074s / 11.024s]
..ics-1.7.0/Classification/CompactClassificationNeuralNetwork.m  pass    6/6    [ 1.895s /  0.173s]
..es/statistics-1.7.0/Classification/CompactClassificationSVM.m  pass   29/29   [ 0.163s /  0.082s]

The ClassificationPartitionedModel.m classdef tests for all supported classifications, hence the long duration. Perhaps there is some limitation in the waiting time that is reduced on the Debian CIs or perhaps these CIs are running on lower specs hardware and the time out limit is reached.

Testing the entire statistics package on my system takes some time:

Fixed test scripts:

                                                        total time (CPU / CLOCK)  [ 391.0s /  241.7s]
Failure Summary:

  ..share/octave/api-v59/packages/statistics-1.7.0/shadow9/mean.m  pass   79/80
                                                    (reported bug) XFAIL   1
  ../share/octave/api-v59/packages/statistics-1.7.0/fillmissing.m  pass  379/380
                                                    (reported bug) XFAIL   1
  ..s/.local/share/octave/api-v59/packages/statistics-1.7.0/pca.m  pass   31/32
                                                (expected failure) XFAIL   1

Summary:

  PASS                            11174
  FAIL                                0
  XFAIL (reported bug)                2
  XFAIL (expected failure)            1

svillemot · 2025-01-13T18:55:32Z

Thanks for your feedback.

I am being told that the Debian machine running these tests is very powerful, with 256 GB RAM and 64 cores.

So I don’t see any other option than disabling all the problematic tests, since I don’t understand the issue at hand.

Is there an easy way to disable all tests related to neural networks? (since only those create a problem)

pr0m1th3as · 2025-01-14T05:34:32Z

I don't know any other way than removing the tests, but I 'd rather not do that. Furthermore, is there a chance that there might be some sort of regression due to some newer library causing this? My system, although Debian based, is not the latest there is. Is there any way that you could increase the amount of waiting time in the CI before it time out?

svillemot · 2025-01-14T10:16:57Z

Unfortunately I cannot control the timeout in the CI, this is decided by another team in Debian.

On my local machine, which is a fairly recent x86-64 desktop, the whole testsuite takes about 90s using an up-to-date Debian unstable (the same as in the CI). So it does not seem that third party libraries are the source of the problem.

Also note that the CI runners use Netlib BLAS/LAPACK. But locally I get mostly the same duration with both Netlib BLAS/LAPACK and OpenBLAS, so I doubt that forcing OpenBLAS in the CI would solve the issue (though I may try).

svillemot · 2025-01-20T12:37:06Z

I ended up disabling the problematic tests because I still don’t understand the underlying issue. Here is the patch that disables the tests, if that may help:
https://salsa.debian.org/pkg-octave-team/octave-statistics/-/blob/debian/latest/debian/patches/tests-disabled-for-debCI.patch?ref_type=heads

dasergatskov · 2025-02-01T12:42:01Z

I also see a long time (but comparable to what @pr0m1th3as has) on Ryzen and it is quite fast on Apple:

 ..ave/api-v59/packages/statistics-1.7.0/@cvpartition/training.m  pass    9/9    [ 0.006s /  0.006s]
  ../statistics-1.7.0/Classification/ClassificationDiscriminant.m  pass   65/65   [ 0.482s /  0.487s]
  ../packages/statistics-1.7.0/Classification/ClassificationGAM.m  pass   34/34   [ 8.153s /  8.159s]
  ../packages/statistics-1.7.0/Classification/ClassificationKNN.m  pass  162/162  [ 1.006s /  1.009s]
  ..statistics-1.7.0/Classification/ClassificationNeuralNetwork.m  pass   59/59   [ 0.144s /  0.144s]
  ..tistics-1.7.0/Classification/ClassificationPartitionedModel.m  pass   20/20   [ 6.925s /  6.925s]
  ../packages/statistics-1.7.0/Classification/ClassificationSVM.m  pass  114/114  [ 0.317s /  0.318s]
  ..tics-1.7.0/Classification/CompactClassificationDiscriminant.m  pass   28/28   [ 0.061s /  0.061s]
  ..es/statistics-1.7.0/Classification/CompactClassificationGAM.m  pass   10/10   [ 2.771s /  2.771s]

Perhaps we should investigate it a little more.

dasergatskov · 2025-02-01T13:17:46Z

A quick profile shows that fcnntrain makes the biggest difference.
On a Mac (M4):

octave:15> profile on
octave:16> test ClassificationPartitionedModel
PASSES 20 out of 20 tests
octave:17> profile off
octave:18> profshow
   #                  Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------------
  88                    repmat             1.287      19.12        60646
 176  __splinefit__>splinebase             0.764      11.34         7504
 178                     ppval             0.747      11.09        15014
 180                  shiftdim             0.746      11.08       120112
 171   __splinefit__>arguments             0.399       5.93         7504
 169             __splinefit__             0.354       5.26         7504
 203                 fcnntrain             0.317       4.71           22
 174                     histc             0.298       4.42        14204
 177                      mkpp             0.204       3.03        15008
 166                 splinefit             0.183       2.72         7504
 128                accumarray             0.158       2.35        11424
 179                    unmkpp             0.085       1.27        15014
 164 @ClassificationGAM/fitGAM             0.085       1.26           17
  92                   reshape             0.066       0.98       343050

On Centos Stream 9 / Ryzen 9 3950X:

octave:16> profile on
octave:17> test ClassificationPartitionedModel
PASSES 20 out of 20 tests
octave:18> profile off
octave:19> profshow
   #                  Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------------
 203                 fcnntrain             9.349      30.34           22
  88                    repmat             3.907      12.68        60646
 176  __splinefit__>splinebase             2.723       8.84         7504
 178                     ppval             2.382       7.73        15014
 180                  shiftdim             2.318       7.52       120112
 171   __splinefit__>arguments             1.186       3.85         7504
 169             __splinefit__             1.115       3.62         7504
 174                     histc             0.921       2.99        14204
 177                      mkpp             0.652       2.11        15008
 166                 splinefit             0.537       1.74         7504
 128                accumarray             0.455       1.48        11424
  96                  binary /             0.392       1.27        78060
  92                   reshape             0.278       0.90       343050
 164 @ClassificationGAM/fitGAM             0.266       0.86           17
 179                    unmkpp             0.261       0.85        15014

pr0m1th3as · 2025-02-01T16:05:06Z

The only think I can assume is that compiling on Mac makes much better use of the #pragma omp parallel statements inside the fcnn.cpp code. Of course I might be wring about it.

dasergatskov · 2025-02-01T20:40:26Z

The only think I can assume is that compiling on Mac makes much better use of the #pragma omp parallel statements inside the fcnn.cpp code. Of course I might be wring about it.

That was a good lead. I think the problem is that #pragma omp parallel does not do anything good for this particular case.
It just spawn an nproc number of threads. On Apple (with its clang compiler) it requires some additional flags to use OMP so it is not used (it runs single thread). On Ryzen (gcc) (16 cores / 32 threads) I see:

$ OMP_NUM_THREADS=1 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 20.8803 seconds.
octave:3>

$ OMP_NUM_THREADS=2 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 23.2406 seconds.

$ OMP_NUM_THREADS=16 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 25.2855 seconds.

$ OMP_NUM_THREADS=32 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 28.6746 seconds.

That may explain @svillemot extra long time with 64 cores (128 threads?)

pr0m1th3as · 2025-02-02T09:14:36Z

This makes sense. But we do need the #pragma omp parallel support, because training under normal circumstances involves larger data and nets., When I initially put together the c++ code for the fcnn classdef, it was was without the parallel directives and on moderate examples I had from my work it would take very long time to train the data.

Is there any efficient way to switch parallel processing inside the c++ code depending on the amount of data? Would this improve the overall performance?

Perhaps, @svillemot can somehow use $ OMP_NUM_THREADS=1 octave -fq prior to testing the statistics package instead of disabling the tests with the patch.

dasergatskov · 2025-02-02T12:55:12Z

Is there any efficient way to switch parallel processing inside the c++ code depending on the amount of data? Would this improve the overall performance?

I am not an expert, but see
https://www.openmp.org/spec-html/5.0/openmpsu110.html

I do not know if it is possible to change max number of OMP threads from within octave. One needs to set
OMP_NUM_THREADS prior to starting an octave session.

Somewhat a side note: OMP by default is setting OMP_NUM_THREADS to the number of nproc which is on CPU with multithreading is twice the number of actual cores. Quite often it is counterproductive, so for your actual use-case
you may want to set OMP_NUM_THREADS to the number of physical cores on your computer or even fewer.

pr0m1th3as · 2025-02-02T13:21:29Z

I do not know if it is possible to change max number of OMP threads from within octave. One needs to set OMP_NUM_THREADS prior to starting an octave session.

From what I 've found online, I could use

 int n_threads = 4;
 omp_set_num_threads(n_threads);
 #pragma omp parallel
 ...

to limit the number of threads from inside the program at running time. However, I am not sure how to determine the most appropriate amount of threads to use based on the layers' size of the fcnn. I can tell from the code that the amount of data does not mmatter and it is only the complexity of each layer of the fcnn where the parallelization is performed. Is there a rule of thumb for this? Or do I have to start testing with trial and error until I figure out a spot at which efficiency is maximized?

dasergatskov · 2025-02-02T13:30:35Z

I seems to remember (but do not quote me on that) that OpenBLAS (or may be it was MKL) sets OMP threads to 1 for matrices with less than 1000 elements. I do not think there is a universal rule of thumb.

dasergatskov · 2025-02-02T13:32:29Z

May be you just make n_threads to be a parameter to you function and let the user to set it as needed?

pr0m1th3as · 2025-02-02T14:08:45Z

I think I will go for a combination of both.

dasergatskov · 2025-02-02T14:20:55Z

I think you should also experiment with chunksize and static vs dynamic scheduler.

…umber of threads for omp as well as 'alpha' parameter for ReLU and ELU activation layers, see issue #168

pr0m1th3as · 2025-02-02T18:57:49Z

I made some changes to the compiled functions to accept number of threads as an input parameter but also to default to 1 thread when computing layers of less than 1000 neurons. The profiling results below on my machine (Intel® Core™ i7-10710U CPU @ 1.10GHz × 12 with Ubuntu 20.04LTS) show similar percentages with those on Mac.

>> pkg load statistics
>> profile on
>> test ClassificationPartitionedModel
PASSES 20 out of 20 tests
>> profile off
>> profshow
   #                  Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------------
  88                    repmat             5.115      16.85        60646
 178                     ppval             3.568      11.75        15014
 176  __splinefit__>splinebase             3.300      10.87         7504
 180                  shiftdim             3.078      10.14       120112
 169             __splinefit__             1.713       5.64         7504
 171   __splinefit__>arguments             1.636       5.39         7504
 204                 fcnntrain             1.394       4.59           22
 174                     histc             1.277       4.21        14204
 177                      mkpp             0.892       2.94        15008
 166                 splinefit             0.825       2.72         7504
 128                accumarray             0.592       1.95        11424
 164 @ClassificationGAM/fitGAM             0.407       1.34           17
  92                   reshape             0.389       1.28       343050
 179                    unmkpp             0.350       1.15        15014
  89                      size             0.292       0.96       288490
  85                   permute             0.280       0.92       120700
  96                  binary /             0.232       0.76        78060
   2                    nargin             0.189       0.62       536884
  42                    evalin             0.188       0.62         7956
  90                       all             0.167       0.55       168884

Relevant tests seem to go faster with the latest changes

>>  t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 26.8613 seconds.
>> t1=tic; test ClassificationNeuralNetwork; toc(t1)
PASSES 59 out of 59 tests
Elapsed time is 0.539263 seconds.

@svillemot Can you test the latest sources to see whether it still produces a time out issue on the CI?

@dasergatskov I haven't changed the functionality in the classdef to accept an additional optional argument for NumThreads, I just use nproc inside the constructor and predict and resubPredict methods of ClassificationNeuralNetwork to determine the max number of logical processors and use it. The logic for defaulting NumThreads to 1 is inside the fcnntrain and fcnnpredict functions. At the moment, I don't have any actual large enough datasets to start training/testing with different layer sizes in order to estimate a better threshold than the 1000 you suggested.

svillemot · 2025-02-22T08:07:28Z

I confirm that the CI timeout issue is now gone with your latest fixes. Thanks.

pr0m1th3as added a commit that referenced this issue Feb 2, 2025

update fcnn* compiled functions to accept extra input arguments for n…

711a475

…umber of threads for omp as well as 'alpha' parameter for ReLU and ELU activation layers, see issue #168

pr0m1th3as closed this as completed Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests related to ClassificationNeuralNetwork very slow on Debian x86-64 CI runners #168

Tests related to ClassificationNeuralNetwork very slow on Debian x86-64 CI runners #168

svillemot commented Jan 13, 2025

pr0m1th3as commented Jan 13, 2025 •

edited

Loading

svillemot commented Jan 13, 2025

pr0m1th3as commented Jan 14, 2025

svillemot commented Jan 14, 2025

svillemot commented Jan 20, 2025

dasergatskov commented Feb 1, 2025

dasergatskov commented Feb 1, 2025 •

edited

Loading

pr0m1th3as commented Feb 1, 2025 •

edited

Loading

dasergatskov commented Feb 1, 2025

pr0m1th3as commented Feb 2, 2025

dasergatskov commented Feb 2, 2025 •

edited

Loading

pr0m1th3as commented Feb 2, 2025 •

edited

Loading

dasergatskov commented Feb 2, 2025

dasergatskov commented Feb 2, 2025 •

edited

Loading

pr0m1th3as commented Feb 2, 2025

dasergatskov commented Feb 2, 2025

pr0m1th3as commented Feb 2, 2025

svillemot commented Feb 22, 2025

Tests related to ClassificationNeuralNetwork very slow on Debian x86-64 CI runners #168

Tests related to ClassificationNeuralNetwork very slow on Debian x86-64 CI runners #168

Comments

svillemot commented Jan 13, 2025

pr0m1th3as commented Jan 13, 2025 • edited Loading

svillemot commented Jan 13, 2025

pr0m1th3as commented Jan 14, 2025

svillemot commented Jan 14, 2025

svillemot commented Jan 20, 2025

dasergatskov commented Feb 1, 2025

dasergatskov commented Feb 1, 2025 • edited Loading

pr0m1th3as commented Feb 1, 2025 • edited Loading

dasergatskov commented Feb 1, 2025

pr0m1th3as commented Feb 2, 2025

dasergatskov commented Feb 2, 2025 • edited Loading

pr0m1th3as commented Feb 2, 2025 • edited Loading

dasergatskov commented Feb 2, 2025

dasergatskov commented Feb 2, 2025 • edited Loading

pr0m1th3as commented Feb 2, 2025

dasergatskov commented Feb 2, 2025

pr0m1th3as commented Feb 2, 2025

svillemot commented Feb 22, 2025

pr0m1th3as commented Jan 13, 2025 •

edited

Loading

dasergatskov commented Feb 1, 2025 •

edited

Loading

pr0m1th3as commented Feb 1, 2025 •

edited

Loading

dasergatskov commented Feb 2, 2025 •

edited

Loading

pr0m1th3as commented Feb 2, 2025 •

edited

Loading

dasergatskov commented Feb 2, 2025 •

edited

Loading