datasize < batch size? #178

OscarDPan · 2021-02-03T21:40:47Z

Hi @maxpumperla do you mind give an explanation why you placed this if-statement in the first place? (cc @danielenricocahall )

https://github.com/maxpumperla/elephas/blob/master/elephas/worker.py#L107
https://github.com/maxpumperla/elephas/blob/master/elephas/worker.py#L116

Does it matter if my batch size is bigger than the available training data?

maxpumperla · 2021-02-04T07:40:46Z

@OscarDPan the idea is that you might not want to train if you don't have enough data to fill a single batch in the first place. We're using train_config, which has a pre-defined batch_size. This may be clearer in the second case above.

I know that this led to errors at some point, but maybe this check isn't needed anymore. Would you be willing to write a test with a dataset that violates the condition and see that training would still run without the if statement?

OscarDPan · 2021-02-04T08:31:14Z

Will do. Thanks.

OscarDPan · 2021-02-05T04:17:25Z

So I was contemplating whether I should remove the check if there's some historical reason ("this led to errors at some point"), and it makes sense that batch size should not be greater than number of samples.

It took me awhile to realized this if-statement after trying to run unittest with huge batch size and model wasn't updated at all.

I wanted to ask you guys what would be the best approach:

Still remove the if statement in "synchronous" mode and "async"/"epoch" mode, and add tests as mentioned above.
Do not remove the if statement but add a super obvious print() to warn end users.
Do number 2 but instead of print(), I create a LoggerMixin object to do the job. This should be the most proper way but it requires lot more code changes

danielenricocahall · 2021-02-06T00:17:36Z

I'm in favor of 3 for sure - it would be great to use a logger in several places in the code instead of print.

maxpumperla · 2021-02-08T10:20:26Z

@OscarDPan if there's no need for the if-clause, go for 1., else please go for 3. print makes no sense in distributed systems most of the time. you want to properly log to the Spark driver.

btw, just to get this conceptually clear, the problem is not "huge batchsizes", but rather small, left-over batches at the end of your training set. I.e. think of a data set of length 10005 and batch-size 100. this gives you a batch of 5 at the end of each epoch. Given that we distribute data, let's say across 10 nodes, there will 100% be nodes that don't get any data for training. I haven't tested if we can handle this scenario properly, but this was an issue at some point (maybe gone with later keras versions). tl;dr: check if we get errors when training with "no data".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasize < batch size? #178

datasize < batch size? #178

OscarDPan commented Feb 3, 2021

maxpumperla commented Feb 4, 2021

OscarDPan commented Feb 4, 2021

OscarDPan commented Feb 5, 2021

danielenricocahall commented Feb 6, 2021

maxpumperla commented Feb 8, 2021

datasize < batch size? #178

datasize < batch size? #178

Comments

OscarDPan commented Feb 3, 2021

maxpumperla commented Feb 4, 2021

OscarDPan commented Feb 4, 2021

OscarDPan commented Feb 5, 2021

danielenricocahall commented Feb 6, 2021

maxpumperla commented Feb 8, 2021