When we are doing batchnorm, so it weights wont go high, right? Then why do we do gradient clipping?
Correction: Batch norm is applied to layer outputs, not weights. It is weight Weight decay prevents weights from becoming too large. Regardless, the purpose of gradient clipping is to prevent sudden undesired changes in weights due to large gradients.