Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests related to ClassificationNeuralNetwork very slow on Debian x86-64 CI runners #168

Closed
svillemot opened this issue Jan 13, 2025 · 18 comments

Comments

@svillemot
Copy link
Member

The tests related to ClassificationNeuralNetwork are extremely slow compared to the others on Debian x86-64 CI runners.

See for example those logs which have timings prepended to each log line:
https://ci.debian.net/packages/o/octave-statistics/unstable/amd64/56397827/
https://ci.debian.net/packages/o/octave-statistics/unstable/amd64/55938923/

This leads to timeouts on the CI runners.

Curiously, the problem does not manifest on other processor architectures (not even on x86-32).

Do you have any idea of what might be causing the problem?

@pr0m1th3as
Copy link
Member

pr0m1th3as commented Jan 13, 2025

The classification classdefs contain several tests each and some of them take reasonable amount of time, because they contain a number of tests that apply training apart from the usual input validation and error checking BISTs.
On my machine (Intel® Core™ i7-10710U CPU @ 1.10GHz × 12 with Ubuntu 20.04LTS) they take the following time:

../statistics-1.7.0/Classification/ClassificationDiscriminant.m  pass   65/65   [ 1.699s/  1.669s]
../packages/statistics-1.7.0/Classification/ClassificationGAM.m  pass   34/34   [32.343s / 31.469s]
../packages/statistics-1.7.0/Classification/ClassificationKNN.m  pass  162/162  [ 3.726s /  3.846s]
..statistics-1.7.0/Classification/ClassificationNeuralNetwork.m  pass   59/59   [13.653s /  1.265s]
..tistics-1.7.0/Classification/ClassificationPartitionedModel.m  pass   19/19   [94.502s / 31.963s]
../packages/statistics-1.7.0/Classification/ClassificationSVM.m  pass  114/114  [ 1.240s /  1.166s]
..tics-1.7.0/Classification/CompactClassificationDiscriminant.m  pass   28/28   [ 0.232s /  0.229s]
..es/statistics-1.7.0/Classification/CompactClassificationGAM.m  pass   10/10   [11.074s / 11.024s]
..ics-1.7.0/Classification/CompactClassificationNeuralNetwork.m  pass    6/6    [ 1.895s /  0.173s]
..es/statistics-1.7.0/Classification/CompactClassificationSVM.m  pass   29/29   [ 0.163s /  0.082s]

The ClassificationPartitionedModel.m classdef tests for all supported classifications, hence the long duration. Perhaps there is some limitation in the waiting time that is reduced on the Debian CIs or perhaps these CIs are running on lower specs hardware and the time out limit is reached.

Testing the entire statistics package on my system takes some time:

Fixed test scripts:

                                                        total time (CPU / CLOCK)  [ 391.0s /  241.7s]
Failure Summary:

  ..share/octave/api-v59/packages/statistics-1.7.0/shadow9/mean.m  pass   79/80
                                                    (reported bug) XFAIL   1
  ../share/octave/api-v59/packages/statistics-1.7.0/fillmissing.m  pass  379/380
                                                    (reported bug) XFAIL   1
  ..s/.local/share/octave/api-v59/packages/statistics-1.7.0/pca.m  pass   31/32
                                                (expected failure) XFAIL   1

Summary:

  PASS                            11174
  FAIL                                0
  XFAIL (reported bug)                2
  XFAIL (expected failure)            1

@svillemot
Copy link
Member Author

Thanks for your feedback.

I am being told that the Debian machine running these tests is very powerful, with 256 GB RAM and 64 cores.

So I don’t see any other option than disabling all the problematic tests, since I don’t understand the issue at hand.

Is there an easy way to disable all tests related to neural networks? (since only those create a problem)

@pr0m1th3as
Copy link
Member

I don't know any other way than removing the tests, but I 'd rather not do that. Furthermore, is there a chance that there might be some sort of regression due to some newer library causing this? My system, although Debian based, is not the latest there is. Is there any way that you could increase the amount of waiting time in the CI before it time out?

@svillemot
Copy link
Member Author

Unfortunately I cannot control the timeout in the CI, this is decided by another team in Debian.

On my local machine, which is a fairly recent x86-64 desktop, the whole testsuite takes about 90s using an up-to-date Debian unstable (the same as in the CI). So it does not seem that third party libraries are the source of the problem.

Also note that the CI runners use Netlib BLAS/LAPACK. But locally I get mostly the same duration with both Netlib BLAS/LAPACK and OpenBLAS, so I doubt that forcing OpenBLAS in the CI would solve the issue (though I may try).

@svillemot
Copy link
Member Author

I ended up disabling the problematic tests because I still don’t understand the underlying issue. Here is the patch that disables the tests, if that may help:
https://salsa.debian.org/pkg-octave-team/octave-statistics/-/blob/debian/latest/debian/patches/tests-disabled-for-debCI.patch?ref_type=heads

@dasergatskov
Copy link

I also see a long time (but comparable to what @pr0m1th3as has) on Ryzen and it is quite fast on Apple:

 ..ave/api-v59/packages/statistics-1.7.0/@cvpartition/training.m  pass    9/9    [ 0.006s /  0.006s]
  ../statistics-1.7.0/Classification/ClassificationDiscriminant.m  pass   65/65   [ 0.482s /  0.487s]
  ../packages/statistics-1.7.0/Classification/ClassificationGAM.m  pass   34/34   [ 8.153s /  8.159s]
  ../packages/statistics-1.7.0/Classification/ClassificationKNN.m  pass  162/162  [ 1.006s /  1.009s]
  ..statistics-1.7.0/Classification/ClassificationNeuralNetwork.m  pass   59/59   [ 0.144s /  0.144s]
  ..tistics-1.7.0/Classification/ClassificationPartitionedModel.m  pass   20/20   [ 6.925s /  6.925s]
  ../packages/statistics-1.7.0/Classification/ClassificationSVM.m  pass  114/114  [ 0.317s /  0.318s]
  ..tics-1.7.0/Classification/CompactClassificationDiscriminant.m  pass   28/28   [ 0.061s /  0.061s]
  ..es/statistics-1.7.0/Classification/CompactClassificationGAM.m  pass   10/10   [ 2.771s /  2.771s]

Perhaps we should investigate it a little more.

@dasergatskov
Copy link

dasergatskov commented Feb 1, 2025

A quick profile shows that fcnntrain makes the biggest difference.
On a Mac (M4):

octave:15> profile on
octave:16> test ClassificationPartitionedModel
PASSES 20 out of 20 tests
octave:17> profile off
octave:18> profshow
   #                  Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------------
  88                    repmat             1.287      19.12        60646
 176  __splinefit__>splinebase             0.764      11.34         7504
 178                     ppval             0.747      11.09        15014
 180                  shiftdim             0.746      11.08       120112
 171   __splinefit__>arguments             0.399       5.93         7504
 169             __splinefit__             0.354       5.26         7504
 203                 fcnntrain             0.317       4.71           22
 174                     histc             0.298       4.42        14204
 177                      mkpp             0.204       3.03        15008
 166                 splinefit             0.183       2.72         7504
 128                accumarray             0.158       2.35        11424
 179                    unmkpp             0.085       1.27        15014
 164 @ClassificationGAM/fitGAM             0.085       1.26           17
  92                   reshape             0.066       0.98       343050

On Centos Stream 9 / Ryzen 9 3950X:

octave:16> profile on
octave:17> test ClassificationPartitionedModel
PASSES 20 out of 20 tests
octave:18> profile off
octave:19> profshow
   #                  Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------------
 203                 fcnntrain             9.349      30.34           22
  88                    repmat             3.907      12.68        60646
 176  __splinefit__>splinebase             2.723       8.84         7504
 178                     ppval             2.382       7.73        15014
 180                  shiftdim             2.318       7.52       120112
 171   __splinefit__>arguments             1.186       3.85         7504
 169             __splinefit__             1.115       3.62         7504
 174                     histc             0.921       2.99        14204
 177                      mkpp             0.652       2.11        15008
 166                 splinefit             0.537       1.74         7504
 128                accumarray             0.455       1.48        11424
  96                  binary /             0.392       1.27        78060
  92                   reshape             0.278       0.90       343050
 164 @ClassificationGAM/fitGAM             0.266       0.86           17
 179                    unmkpp             0.261       0.85        15014

@pr0m1th3as
Copy link
Member

pr0m1th3as commented Feb 1, 2025

The only think I can assume is that compiling on Mac makes much better use of the #pragma omp parallel statements inside the fcnn.cpp code. Of course I might be wring about it.

@dasergatskov
Copy link

The only think I can assume is that compiling on Mac makes much better use of the #pragma omp parallel statements inside the fcnn.cpp code. Of course I might be wring about it.

That was a good lead. I think the problem is that #pragma omp parallel does not do anything good for this particular case.
It just spawn an nproc number of threads. On Apple (with its clang compiler) it requires some additional flags to use OMP so it is not used (it runs single thread). On Ryzen (gcc) (16 cores / 32 threads) I see:

$ OMP_NUM_THREADS=1 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 20.8803 seconds.
octave:3> 
$ OMP_NUM_THREADS=2 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 23.2406 seconds.
$ OMP_NUM_THREADS=16 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 25.2855 seconds.
$ OMP_NUM_THREADS=32 octave -fq
octave:1> pkg load statistics
octave:2> t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 28.6746 seconds.

That may explain @svillemot extra long time with 64 cores (128 threads?)

@pr0m1th3as
Copy link
Member

This makes sense. But we do need the #pragma omp parallel support, because training under normal circumstances involves larger data and nets., When I initially put together the c++ code for the fcnn classdef, it was was without the parallel directives and on moderate examples I had from my work it would take very long time to train the data.

Is there any efficient way to switch parallel processing inside the c++ code depending on the amount of data? Would this improve the overall performance?

Perhaps, @svillemot can somehow use $ OMP_NUM_THREADS=1 octave -fq prior to testing the statistics package instead of disabling the tests with the patch.

@dasergatskov
Copy link

dasergatskov commented Feb 2, 2025

Is there any efficient way to switch parallel processing inside the c++ code depending on the amount of data? Would this improve the overall performance?

I am not an expert, but see
https://www.openmp.org/spec-html/5.0/openmpsu110.html

I do not know if it is possible to change max number of OMP threads from within octave. One needs to set
OMP_NUM_THREADS prior to starting an octave session.

Somewhat a side note: OMP by default is setting OMP_NUM_THREADS to the number of nproc which is on CPU with multithreading is twice the number of actual cores. Quite often it is counterproductive, so for your actual use-case
you may want to set OMP_NUM_THREADS to the number of physical cores on your computer or even fewer.

@pr0m1th3as
Copy link
Member

pr0m1th3as commented Feb 2, 2025

I do not know if it is possible to change max number of OMP threads from within octave. One needs to set OMP_NUM_THREADS prior to starting an octave session.

From what I 've found online, I could use

 int n_threads = 4;
 omp_set_num_threads(n_threads);
 #pragma omp parallel
 ...

to limit the number of threads from inside the program at running time. However, I am not sure how to determine the most appropriate amount of threads to use based on the layers' size of the fcnn. I can tell from the code that the amount of data does not mmatter and it is only the complexity of each layer of the fcnn where the parallelization is performed. Is there a rule of thumb for this? Or do I have to start testing with trial and error until I figure out a spot at which efficiency is maximized?

@dasergatskov
Copy link

I seems to remember (but do not quote me on that) that OpenBLAS (or may be it was MKL) sets OMP threads to 1 for matrices with less than 1000 elements. I do not think there is a universal rule of thumb.

@dasergatskov
Copy link

dasergatskov commented Feb 2, 2025

May be you just make n_threads to be a parameter to you function and let the user to set it as needed?

@pr0m1th3as
Copy link
Member

I think I will go for a combination of both.

@dasergatskov
Copy link

I think you should also experiment with chunksize and static vs dynamic scheduler.

pr0m1th3as added a commit that referenced this issue Feb 2, 2025
…umber of threads for omp as well as 'alpha' parameter for ReLU and ELU activation layers, see issue #168
@pr0m1th3as
Copy link
Member

I made some changes to the compiled functions to accept number of threads as an input parameter but also to default to 1 thread when computing layers of less than 1000 neurons. The profiling results below on my machine (Intel® Core™ i7-10710U CPU @ 1.10GHz × 12 with Ubuntu 20.04LTS) show similar percentages with those on Mac.

>> pkg load statistics
>> profile on
>> test ClassificationPartitionedModel
PASSES 20 out of 20 tests
>> profile off
>> profshow
   #                  Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------------
  88                    repmat             5.115      16.85        60646
 178                     ppval             3.568      11.75        15014
 176  __splinefit__>splinebase             3.300      10.87         7504
 180                  shiftdim             3.078      10.14       120112
 169             __splinefit__             1.713       5.64         7504
 171   __splinefit__>arguments             1.636       5.39         7504
 204                 fcnntrain             1.394       4.59           22
 174                     histc             1.277       4.21        14204
 177                      mkpp             0.892       2.94        15008
 166                 splinefit             0.825       2.72         7504
 128                accumarray             0.592       1.95        11424
 164 @ClassificationGAM/fitGAM             0.407       1.34           17
  92                   reshape             0.389       1.28       343050
 179                    unmkpp             0.350       1.15        15014
  89                      size             0.292       0.96       288490
  85                   permute             0.280       0.92       120700
  96                  binary /             0.232       0.76        78060
   2                    nargin             0.189       0.62       536884
  42                    evalin             0.188       0.62         7956
  90                       all             0.167       0.55       168884

Relevant tests seem to go faster with the latest changes

>>  t1=tic; test ClassificationPartitionedModel; toc(t1)
PASSES 20 out of 20 tests
Elapsed time is 26.8613 seconds.
>> t1=tic; test ClassificationNeuralNetwork; toc(t1)
PASSES 59 out of 59 tests
Elapsed time is 0.539263 seconds.

@svillemot Can you test the latest sources to see whether it still produces a time out issue on the CI?

@dasergatskov I haven't changed the functionality in the classdef to accept an additional optional argument for NumThreads, I just use nproc inside the constructor and predict and resubPredict methods of ClassificationNeuralNetwork to determine the max number of logical processors and use it. The logic for defaulting NumThreads to 1 is inside the fcnntrain and fcnnpredict functions. At the moment, I don't have any actual large enough datasets to start training/testing with different layer sizes in order to estimate a better threshold than the 1000 you suggested.

@svillemot
Copy link
Member Author

I confirm that the CI timeout issue is now gone with your latest fixes. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants