The weight update depends on the optimizer used, but usually, the update happens after a single batch is processed:
for inputs, targets in dataset: # acquire a batch
optimizer.zero_grad() # set grads to 0
outputs = net(inputs) # get network response
loss = loss_function(outputs, targets) # calculate loss for given outputs and targets
loss.backward() # backpropagate the error
optimizer.step() # <--- here the actual weight update happen
The epoch is a different term here. I think you actually mean training steps.
The batch size is a number you have to “feel” on your own. Sometimes you’ll fiddle with this hyperparameter a lot, sometimes you’ll get it right at the first time. The network architecture may also play a crucial role here.
The smaller the batch size, the noisier the weight updates become (because of less examples participating in “voting” for best direction to move). It might not converge overall, because it will circle around the minima (local or global).
As you correctly point, the larger the batch size is, the slower the convergence. But it’s probably because of the weight updates becoming close to 0 (100 params say “go with 1.5”, 50 say “nope, -2 is a great direction”, while the last 10 say “-5”, you get the point).
Some optimizers try to avoid that by utilizing adaptive learning rates for each of the parameter. But it’s not a perfect solution.
I think there isn’t a “go largest batch size you can” statement. The problem with VRAM is usually fitting the data and grads into the GPU, while avoiding the problems associated with smaller batch sizes.
The more complex your network becomes, the more memory on GPU it will need during backpropagation. But deeper architectures seem to be better at “grasping” the idea from data.
I think that’s why it becomes important to have large amounts of VRAM when training a network.