Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasize < batch size? #178

Open
OscarDPan opened this issue Feb 3, 2021 · 5 comments
Open

datasize < batch size? #178

OscarDPan opened this issue Feb 3, 2021 · 5 comments

Comments

@OscarDPan
Copy link

Hi @maxpumperla do you mind give an explanation why you placed this if-statement in the first place? (cc @danielenricocahall )

https://github.com/maxpumperla/elephas/blob/master/elephas/worker.py#L107
https://github.com/maxpumperla/elephas/blob/master/elephas/worker.py#L116

Does it matter if my batch size is bigger than the available training data?

@maxpumperla
Copy link
Owner

@OscarDPan the idea is that you might not want to train if you don't have enough data to fill a single batch in the first place. We're using train_config, which has a pre-defined batch_size. This may be clearer in the second case above.

I know that this led to errors at some point, but maybe this check isn't needed anymore. Would you be willing to write a test with a dataset that violates the condition and see that training would still run without the if statement?

@OscarDPan
Copy link
Author

Will do. Thanks.

@OscarDPan
Copy link
Author

So I was contemplating whether I should remove the check if there's some historical reason ("this led to errors at some point"), and it makes sense that batch size should not be greater than number of samples.

It took me awhile to realized this if-statement after trying to run unittest with huge batch size and model wasn't updated at all.

I wanted to ask you guys what would be the best approach:

  1. Still remove the if statement in "synchronous" mode and "async"/"epoch" mode, and add tests as mentioned above.
  2. Do not remove the if statement but add a super obvious print() to warn end users.
  3. Do number 2 but instead of print(), I create a LoggerMixin object to do the job. This should be the most proper way but it requires lot more code changes

@danielenricocahall
Copy link
Collaborator

I'm in favor of 3 for sure - it would be great to use a logger in several places in the code instead of print.

@maxpumperla
Copy link
Owner

@OscarDPan if there's no need for the if-clause, go for 1., else please go for 3. print makes no sense in distributed systems most of the time. you want to properly log to the Spark driver.

btw, just to get this conceptually clear, the problem is not "huge batchsizes", but rather small, left-over batches at the end of your training set. I.e. think of a data set of length 10005 and batch-size 100. this gives you a batch of 5 at the end of each epoch. Given that we distribute data, let's say across 10 nodes, there will 100% be nodes that don't get any data for training. I haven't tested if we can handle this scenario properly, but this was an issue at some point (maybe gone with later keras versions). tl;dr: check if we get errors when training with "no data".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants