Using Adagrad instead of SGD makes a difference

Continuing the discussion from Assignment 2: Train Your First Model:

I had the issue of the loss stagnating around 12k using SGD. (Not sure if this is happening just with me)

I used Adagrad as the optimizer to solve this issue. I felt that the learning rate needs to be high at the beginning and then tapper off as the epoch progresses. It seems the loss function isn’t very smooth and normal SGD will get stuck in a local or saddle point (or just the loss function misbehaves maybe). For me the loss is at approximately 3k (close to the lowest reported in the forum) after using Adagrad. I feel using Adagrad also solves the issue of running the training loop at different learning rates manually to some extent.

Code and Hyperparameters:
opt = torch.optim.Adagrad(model.parameters(),lr=lr, weight_decay = weight_decay, lr_decay=0.1,)

batch_size = 64
epochs = 100
runs = 100
loss_func = nn.L1Loss()

Optimizer Hyperparameters

lr = 130 (It’s High but Adagrad will take care of that :slight_smile: )
momentum = 0
weight_decay = 0
lr_decay=0.1

You could also try Adam.

Also learning rate above 100 is kinda weird, but if it works…

You could look at schedulers which modify the LR as the training progresses.