gradient clipping, weight decay and dropout

Hi all, please enlighten me on the following concepts:

  1. gradient clipping is used for what purpose? what would be good value for it? smaller or bigger values?

  2. what is weight decay? when to use it? what would be goood values for training? bigger or smaller values better?

  3. is random droput only used before a fully connected layer? if I have several fully connected layers, would it make sense to have random dropout for each layer with different values?

  4. how to obtain the ImageNet images for training/validation? I see that torchvision.dataset has several datasets available for training such as CIFAR10, CIFAR100, COCO, but the imagenet is just a wrapper to handle images from ImageNet, similar to ImageFolder, not an image dataset itself… so I’m wondering, how do those researchers train their models with images from ImageNet if there’s no standardized way of getting to a fixed dataset that alll researchers can assess their accuracy with?

  5. how do I use SRNET to scale a non square photo? from what I read, SENET/SR resnet cater only to square images, and I think only 224x224, what would be recommended way to scale a say, 16:9 photo? My impression is that as long as the pixel counts match the network’s requirement, it should be able to make it to the final layer for scaling, is that right?

  6. How do I speed up training on Windows? I dual booted my PC between Windows 10 and Linux mint 19 (different SSD so they don’t know each otther’s presence). I have installed anaconda and pytorch on both and I noticed that when training the same exact model uusing the same exact code and dataset, Windows is like 3x slower than linux. From the task manager’s performance tab, I can also see that the GPU is not being utilized as much as it was on linux. I have tried installing cuda toolkit to no avail, it’s still as slow… has anyone experienced this? I trained using a Titan X (maxwell core) so GPU has enough power to train.

  7. is pyrorch able to make use of the tensor cores in RTX 20xx and 30xx series of GPU? would it actually make a difference whether tensor cores are present?

please enlighten me on the above, thank you very much for your time :slight_smile:

  1. It’s to avoid gradients becoming overly large. The good values are really up to the model. I’ve seen values below 0.5 in most cases.
  2. Weight decay is used to prevent the model from overfitting to the training data. The implementation depends, sometimes it’s used as an term in loss function, other times it’s buried inside optimizer (which probably accounts for that somehow as well). Again, the value depends (you’re gonna hate this word). Usually I’ve used values smaller than LR.
  3. It can be used everywhere, what you have in mind is probably the dropout between conv and FC layers. It’s mainly used to prevent overfitting of the data (the model can’t rely on a single output from previous layer, it has to learn that some other features are important too).
  4. I looked at ImageNet class and it seems it should be enough. There’s an download argument that let’s you download this dataset. Perhaps that’s what you’re looking for.
  5. I would split the image into squares (if the model allows it) and the concatenate the results somehow (really depends on the model). I’ve written about unfold function in my notebook during the previous course. You might need to check that.
  6. No idea about that, you might need to check if CUDA is available there: torch.cuda.is_available() Perhaps there’s something with the setup.
  7. The TPUs really matter only when you’re able to parallelize the model. There’s a library called pytorch-xla (or something like that). That in theory makes it possible to use TPU, but I’ve had no luck with using it on kaggle. You might check it locally though. Otherwise there’s a keras that has the TPU support out of the box I think (never tried that tbh).

Whew, a lot of questions :stuck_out_tongue:

merci beaucoup for taking the time and patience to reply.

For windows training, yes, I’ve checked that cuda is available using the torch.cuda.is_available() in the DOS console… but it’s still slow. no idea why…

I’ll delve further into ImageNet and super resolution.

thank you very much :slight_smile: :smiley: