Some questions about Data augmentation, Batch Normalization & ResNets

@aakashns I have a lot of questions. Thanks for your patience with a newbie and hopefully answering some of these will help other students that come across the answers.

  • Is it useful to augment images and add those copies to the dataset or would the duplicates not be helpful?

  • Is there any literature or research that you can point to that explains the math or reasoning between residual blocks that do g(f(x)) + x vs g(f(x) + x) ?

  • Is there are reason that we add f(x) and x together and not subtract them? I’m also imagining that you could apply other function to learn things between the f(x) and x.

  • Is there a reason to do two conv layers in the residual block instead of one?

  • In a perfect world we wouldn’t need to do batches, right? If we processed all images in one batch, would you still want to do batch normalization for the one batch?

  • Would it make sense to do all the batches first and then do the normalization weights for a given layer for all batches at once based on their summary stats?

  • Is it best practices to build out the feature block to a 4x4 matrix? Is it just because you can’t max pool on a 2x2 matrix because it is already as small as possible?

5 Likes

@bradhanks Thanks for the great questions!

We are already augmenting images randomly while training, so there’s little additional benefit gained by adding augmented copies to the dataset. One place where it can help is if there are very few images for a specific class. Adding copies (augmented or otherwise) for specific classes is called oversampling.

Yes, see this paper: https://arxiv.org/pdf/1603.05027.pdf

Also explored in the paper linked above. They seem to have found that 2 conv layers work better than 1.

You would still to batch normalization to prevent intermediate outputs from any layer from becoming too large.

Batch normalization is a little more involved, my explanation in the lecture was a simplification. See https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c

Is it best practices to build out the feature block to a 4x4 matrix? Is it just because you can’t max pool on a 2x2 matrix because it is already as small as possible?

Modern ResNets use something called Adaptive pooling, which can reduce a feature map of any size (2x2, 4x4, 8x8) to a single element. This allows them to take images of any size as input. The best practice would be to make your model independent of the size of feature map.

2 Likes