Optimiser

In the Neural Network, we optimise (minimise) the loss by choosing weights matrix, WW and bb.

argmaxW,bLoss

By taking the F.O.C., we get the gradient, \nabla. Then, we update weights by minus the learning rate times the gradient.

Here, I wrote some test algorithms to find how optimisers evolve.

Preparation

Let’s make some preparations and notation clarifications.

Notation:

  • Y is True,
  • T is the Estimate.

Y=XW+b+e

Let T=WX+bT = WX +b

loss=eTe

loss=(YT)T(YT)

=(WX+b+eT)T(WX+b+eT)

LossW=XT(WX+b+eT)=2XT(YT)

lossb=1T(WX+b+eT)=1T(YT)

Just One More Thing We Consider Here!

We do not want the number of sample to affect loss function’s value, so we take averge, instead of sumation.

LossW=2N XT(YT),lossb=1N 1T(YT)

Optimisers

Gradient Descent

Wt=Wt1ηW

bt=bt1ηb

, where η\eta is the learning rate.

However, we may find that the Gradient Descent might not satisfactory, because gradients might not be available or vanished sometimes. Also, the process of gradient descent largely depends on the HyperParameter η\eta. Some other methods of optimisation are developed.

Different Optimisers aim to find the optimal weights more speedy and more accurate. In the followings are how optimisers are designed.

Stochastic Gradient Descent – SGD

In SGD, the stochastic part is added to avoid model train full dataset, and avoid resulting in the problem of overfitting.

Momentum SGD

A momentum term is added.

  • Apply Momentum to the Gradient, \nabla.

Gradient Descent:wt=wt1ηgw

Momentum:vt=β1vt1+(1β1)gw

In the Beginning of Iteration, v0=0v_0=0, so we amend it to be vT^\hat{v_T}

vt^=vt1β1t

,where β1\beta_1 assign weights between previous value and the gradient.

Replace the gradient gwg_w by the amended momentum term v^t\hat{v}_t:

wt=wt1ηv^t

P.S. Why we amend vtv_t by dividing 1+βt1 + \beta^t.

As vt=β1vt1+(1β1)gwv_t = \beta_1 v_{t-1} + (1-\beta_1)g_w, which is like a geometric decaying polynomial function. The difference is that ww and gwg_w keep updating each period. Let’s assume there is no updating anymore, or in other word, model has converge. gw=constantg_w = constant. {vt}t=0T\{ v_t \}_{t=0}^T would be,

v0=0

v1=(1β1)gw

v2=β1(1β1)gw+(1β1)gw

v3=β12(1β1)gw+β1(1β1)gw+(1β1)gw

vn=β1n1(1β1)gw++(1β1)gw

vn=(1β1)gw(1+β1++β1n1)

vn=(1β1)gw1β1n1β1=gw(1β1n)

Clearly, to amend vnv_n be like gwg_w, we need to divide it by 1β1n1-\beta_1^n, because of the polynomial ‘geometric’ form.

RMSprop

  • Apply a Transformation to the Gradient, \nabla.

Gradient Descent:wt=wt1ηgw

RMS:mt=β1vt1+(1β2)gw2

In the Beginning of Iteration, v0=0v_0=0, so we amend it to be vT^\hat{v_T}, (remember there is a power tt here).

mt^=mt1β1t

,where β1\beta_1 assign weights between previous value and the gradient.

Add a square root of m^t\hat{m}_t in the denominator:

wt=wt1ηgwm^t

Adam

Adam is nearly the most powerful optimiser so far. Adam is like a combination of Momentum and RMS, both m^\hat{m} and v^\hat{v} are added to adjust the speed of approaching to the optimal points, WW^*, and bb^*.

wt=wt1ηv^tm^t

Three main hyper parameters are η=0.01\eta = 0.01, β1=0.9\beta_1 = 0.9, and β2=0.999\beta_2 = 0.999.

Code