batch size vs step size

Hi, please enlighten me on the following:
based on the same dataset, a large batch size would mean less number of epoch runs. Since the weight adjustments are made after every epoch run, does that not mean a large batch size will converge slower since there’s less adjustments to the weights being made due to less epoch runs? if so, what then is the purpose of having a large batch size?
the situation would be made worse if the learning rate is small with a large batch size since it would converge even slower.

If the above observation is true, what then is the purpose of having those super powerful GPU with huge amounts of VRAM when training with large batch size actually converges slower. Could someone please enlighten me ?

Batch size

The weight update depends on the optimizer used, but usually, the update happens after a single batch is processed:

for inputs, targets in dataset: # acquire a batch
   optimizer.zero_grad() # set grads to 0
   outputs = net(inputs) # get network response
   loss = loss_function(outputs, targets) # calculate loss for given outputs and targets
   loss.backward() # backpropagate the error
   optimizer.step() # <--- here the actual weight update happen

print("Epoch done")

The epoch is a different term here. I think you actually mean training steps.

The batch size is a number you have to “feel” on your own. Sometimes you’ll fiddle with this hyperparameter a lot, sometimes you’ll get it right at the first time. The network architecture may also play a crucial role here.

The smaller the batch size, the noisier the weight updates become (because of less examples participating in “voting” for best direction to move). It might not converge overall, because it will circle around the minima (local or global).

As you correctly point, the larger the batch size is, the slower the convergence. But it’s probably because of the weight updates becoming close to 0 (100 params say “go with 1.5”, 50 say “nope, -2 is a great direction”, while the last 10 say “-5”, you get the point).
Some optimizers try to avoid that by utilizing adaptive learning rates for each of the parameter. But it’s not a perfect solution.


I think there isn’t a “go largest batch size you can” statement. The problem with VRAM is usually fitting the data and grads into the GPU, while avoiding the problems associated with smaller batch sizes.

The more complex your network becomes, the more memory on GPU it will need during backpropagation. But deeper architectures seem to be better at “grasping” the idea from data.

I think that’s why it becomes important to have large amounts of VRAM when training a network.

thank you for taking the time to respond. that cleared up my doubts.
merci beaucoup :smiley: