OMG, the gradient is exploding!

Chill, SGD will save us all.

This time last year, I was having a tough time reproducing BEGAN and WGAN magics on our Context-aware video prediction work (to add some ‘novelty’). Things didn’t go well, cuz OMG, the balance between G and D is art(witchcraft) instead of science. Gaussian in, garbage out is how my GAN works. I still remember some random details from the two papers, such as they use RMSProp because Adam does not work, and they don’t really know why… These kind of information and the suffering experience push me away from the Min-Max world for a while. **NOW I’M BACK!**

For my thesis, Wen asked me to read Ganin’s reverse gradient paper. This general trick for Min-Max problem is AMAZING!!! Without tuning any parameter, the trick works flawlessly on my own task. (Feeling that I am the King[Slave] of Min-Max problems.) Traditionally, you would schedule the training of main net and discriminators using some heiristics. It can easily takes tons of hours while your model still collapses. The reverse-gradient trick finds a way to (somehow) solve Min-Max by minimize one loss. So no more scheduling, everything unified in one loss fuction.

$$\theta_f \leftarrow \theta_f – \mu (\frac{\partial L_y^i}{\partial \theta_f} – \lambda \frac{\partial L_d^i}{\partial\theta_f}) \

\theta_y \leftarrow \theta_y – \mu\frac{\partial L_y^i}{\partial\theta_y} \

\theta_d \leftarrow \theta_d – \mu\frac{\partial L_d^i}{\partial\theta_d}

$$

The original paper uses SGD with momentum for training. In the beginning, I simply follows this setting, until I noticed some weird behaviors. Roughly one out of ten runs, the model is able to converge at a better plateau, yielding 5-7% performance boost, but I was not able to stabilize it by tuning batch size, lr schedule, other parameters and even initilizers. (Truncated random dist works better than xavier for some reason in my case). After two days of watching worldcups while tuning parameters, the “rmsprop works, adam not” memory from last year encourages me to try some other optimizers. GANHacks gives me some practical guides such as using Adam for main net and SGD for discriminator. Similar to my memory, Adam did not work in the beginning. This was resolved by modifying Epsilon to a larger value (1.). Adam + SGD did narrow the gap, but was not able to eliminate it. To my very surprise, SGD without momentum gives me the best performance. It seems that the acceleration effect brought by momentum leads to miss of better convergence points. This is a quite interesting observation because when I was in the deep learning course, professors usually claim that SGD with momentum generally works better.