{"id":5051,"date":"2023-01-12T12:48:50","date_gmt":"2023-01-12T04:48:50","guid":{"rendered":"https:\/\/fanyuzhao.com\/?p=5051"},"modified":"2023-01-12T15:38:34","modified_gmt":"2023-01-12T07:38:34","slug":"optimiser","status":"publish","type":"post","link":"https:\/\/fanyuzhao.com\/?p=5051","title":{"rendered":"Optimiser"},"content":{"rendered":"\n<p>In the Neural Network, we optimise (minimise) the loss by choosing weights matrix, <span class=\"katex math inline\">W<\/span> and <span class=\"katex math inline\">b<\/span>.<\/p>\n\n\n\n<p>$$ arg\\max_{W, b} Loss  $$<\/p>\n\n\n\n<p>By taking the F.O.C., we get the gradient, <span class=\"katex math inline\">\\nabla<\/span>. Then, we update weights by minus the learning rate times the gradient.<\/p>\n\n\n\n<p>Here, I wrote some test algorithms to find how optimisers evolve.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Preparation<\/h2>\n\n\n\n<p>Let&#8217;s make some preparations and notation clarifications.<\/p>\n\n\n\n<p><strong>Notation:<\/strong><\/p>\n\n\n\n<ul><li>Y is True,<\/li><li>T is the Estimate.<\/li><\/ul>\n\n\n\n<p>$$Y = XW + b + e$$<\/p>\n\n\n\n<p>Let <span class=\"katex math inline\">T = WX +b<\/span><\/p>\n\n\n\n<p>$$ loss = e^T e $$<\/p>\n\n\n\n<p>$$ loss = (Y-T)^T \\cdot (Y-T) $$<\/p>\n\n\n\n<p>$$= (WX+b+e-T)^T \\cdot (WX+b+e-T) $$<\/p>\n\n\n\n<p>$$ \\frac{\\partial Loss}{\\partial W} = X^T (WX+b+e-T) = 2X^T (Y-T) $$<\/p>\n\n\n\n<p>$$ \\frac{\\partial loss }{\\partial b} = 1^T (WX+b+e-T) = 1^T (Y-T)$$<\/p>\n\n\n\n<p>Just One More Thing We Consider Here!<\/p>\n\n\n\n<p>We do not want the number of sample to affect loss function&#8217;s value, so we take averge, instead of sumation.<\/p>\n\n\n\n<p>$$ \\frac{\\partial Loss}{\\partial W} = \\frac{2}{N}\\ X^T (Y-T) \\quad, \\quad \\frac{\\partial loss }{\\partial b} = \\frac{1}{N} \\ 1^T (Y-T) $$<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Optimisers<\/h2>\n\n\n\n<h4 class=\"wp-block-heading\">Gradient Descent<\/h4>\n\n\n\n<p>$$ W_t = W_{t-1} &#8211; \\eta \\nabla_W $$<\/p>\n\n\n\n<p>$$ b_t = b_{t-1} &#8211; \\eta \\nabla_b $$<\/p>\n\n\n\n<p>, where <span class=\"katex math inline\">\\eta<\/span> is the learning rate.<\/p>\n\n\n\n<p>However, we may find that the Gradient Descent might not satisfactory, because gradients might not be available or vanished sometimes. Also, the process of gradient descent largely depends on the HyperParameter <span class=\"katex math inline\">\\eta<\/span>. Some other methods of optimisation are developed.<\/p>\n\n\n\n<p>Different Optimisers aim to find the optimal weights more speedy and more accurate. In the followings are how optimisers are designed.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Stochastic Gradient Descent &#8211; SGD<\/h4>\n\n\n\n<p>In SGD, the stochastic part is added to avoid model train full dataset, and avoid resulting in the problem of <strong>overfitting<\/strong>.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-9.png\" alt=\"\" class=\"wp-image-5065\" width=\"274\" height=\"195\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-9.png 606w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-9-300x214.png 300w\" sizes=\"(max-width: 274px) 100vw, 274px\" \/><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\">Momentum SGD<\/h4>\n\n\n\n<p>A momentum term is added.<\/p>\n\n\n\n<ul><li>Apply Momentum to the Gradient, <span class=\"katex math inline\">\\nabla<\/span>.<\/li><\/ul>\n\n\n\n<p>$$\\text{Gradient Descent:} \\quad w_t = w_{t-1} -\\eta g_w$$<\/p>\n\n\n\n<p>$$ \\text{Momentum:}\\quad v_t = \\beta_1 v_{t-1} + (1-\\beta_1)g_w $$<\/p>\n\n\n\n<p>In the Beginning of Iteration, <span class=\"katex math inline\">v_0=0<\/span>, so we <strong>amend<\/strong> it to be <span class=\"katex math inline\">\\hat{v_T}<\/span><\/p>\n\n\n\n<p>$$ \\hat{v_t} = \\frac{v_t}{1-\\beta_1^t} $$<\/p>\n\n\n\n<p>,where <span class=\"katex math inline\">\\beta_1<\/span> assign weights between previous value and the gradient.<\/p>\n\n\n\n<p>Replace the gradient <span class=\"katex math inline\">g_w<\/span> by the amended momentum term <span class=\"katex math inline\">\\hat{v}_t<\/span>:<\/p>\n\n\n\n<p>$$ w_t = w_{t-1} -\\eta \\hat{v}_t $$<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-10.png\" alt=\"\" class=\"wp-image-5067\" width=\"284\" height=\"202\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-10.png 606w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-10-300x214.png 300w\" sizes=\"(max-width: 284px) 100vw, 284px\" \/><\/figure><\/div>\n\n\n<p>P.S. Why we amend <span class=\"katex math inline\">v_t<\/span> by dividing <span class=\"katex math inline\">1 + \\beta^t<\/span>.<\/p>\n\n\n\n<p>As <span class=\"katex math inline\">v_t = \\beta_1 v_{t-1} + (1-\\beta_1)g_w<\/span>, which is like a geometric decaying polynomial function. The difference is that <span class=\"katex math inline\">w<\/span> and <span class=\"katex math inline\">g_w<\/span> keep updating each period. Let&#8217;s assume there is no updating anymore, or in other word, model has converge. <span class=\"katex math inline\">g_w = constant<\/span>. <span class=\"katex math inline\">\\{ v_t \\}_{t=0}^T<\/span> would be,<\/p>\n\n\n\n<p>$$ v_0 = 0 $$<\/p>\n\n\n\n<p>$$ v_1 = (1-\\beta_1)g_w $$<\/p>\n\n\n\n<p>$$ v_2 = \\beta_1 (1-\\beta_1)g_w + (1-\\beta_1) g_w $$<\/p>\n\n\n\n<p>$$ v_3 = \\beta_1^2(1-\\beta_1)g_w + \\beta_1 (1-\\beta_1)g_w + (1-\\beta_1)g_w $$<\/p>\n\n\n\n<p>$$ v_n = \\beta_1^{n-1}(1-\\beta_1)g_w + &#8230; + (1-\\beta_1)g_w $$<\/p>\n\n\n\n<p>$$ v_n = (1-\\beta_1)g_w (1+\\beta_1 + &#8230; +\\beta_1^{n-1})  $$<\/p>\n\n\n\n<p>$$ v_n = (1-\\beta_1)g_w \\frac{1 &#8211; \\beta_1^n}{1-\\beta_1} =g_w (1-\\beta_1^n)$$<\/p>\n\n\n\n<p>Clearly, to amend <span class=\"katex math inline\">v_n<\/span> be like <span class=\"katex math inline\">g_w<\/span>, we need to divide it by <span class=\"katex math inline\">1-\\beta_1^n<\/span>, because of the polynomial  &#8216;geometric&#8217; form.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">RMSprop<\/h4>\n\n\n\n<ul><li>Apply a Transformation to the Gradient, <span class=\"katex math inline\">\\nabla<\/span>.<\/li><\/ul>\n\n\n\n<p>$$\\text{Gradient Descent:} \\quad w_t = w_{t-1} -\\eta g_w$$<\/p>\n\n\n\n<p>$$ \\text{RMS:}\\quad m_t = \\beta_1 v_{t-1} + (1-\\beta_2)g^2_w $$<\/p>\n\n\n\n<p>In the Beginning of Iteration, <span class=\"katex math inline\">v_0=0<\/span>, so we amend it to be <span class=\"katex math inline\">\\hat{v_T}<\/span>, (remember there is a power <span class=\"katex math inline\">t<\/span> here).<\/p>\n\n\n\n<p>$$ \\hat{m_t} = \\frac{m_t}{1-\\beta_1^t} $$<\/p>\n\n\n\n<p>,where <span class=\"katex math inline\">\\beta_1<\/span> assign weights between previous value and the gradient.<\/p>\n\n\n\n<p>Add a square root of <span class=\"katex math inline\">\\hat{m}_t<\/span> in the denominator:<\/p>\n\n\n\n<p>$$ w_t = w_{t-1} -\\eta \\frac{g_w}{\\sqrt{\\hat{m}_t}} $$<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-11.png\" alt=\"\" class=\"wp-image-5069\" width=\"306\" height=\"217\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-11.png 606w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-11-300x214.png 300w\" sizes=\"(max-width: 306px) 100vw, 306px\" \/><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\">Adam<\/h4>\n\n\n\n<p>Adam is nearly the most powerful optimiser so far. Adam is like a combination of Momentum and RMS, both <span class=\"katex math inline\">\\hat{m}<\/span> and <span class=\"katex math inline\">\\hat{v}<\/span> are added to adjust the speed of approaching to the optimal points, <span class=\"katex math inline\">W^*<\/span>, and <span class=\"katex math inline\">b^*<\/span>. <\/p>\n\n\n\n<p>$$ w_t = w_{t-1} &#8211; \\eta \\cdot \\frac{ \\hat{v}_t }{\\sqrt{\\hat{m}_t}} $$<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-16.png\" alt=\"\" class=\"wp-image-5101\" width=\"264\" height=\"188\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-16.png 606w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-16-300x214.png 300w\" sizes=\"(max-width: 264px) 100vw, 264px\" \/><\/figure><\/div>\n\n\n<p>Three main hyper parameters are <span class=\"katex math inline\">\\eta = 0.01<\/span>, <span class=\"katex math inline\">\\beta_1 = 0.9<\/span>, and <span class=\"katex math inline\">\\beta_2 = 0.999<\/span>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Code<\/h2>\n\n\n\n<div class=\"wp-block-file\"><a id=\"wp-block-file--media-01e3422e-a854-4e68-8d5a-a01c3c42d4da\" href=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/Optimiser.html\">Optimiser<\/a><a href=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/Optimiser.html\" class=\"wp-block-file__button\" download aria-describedby=\"wp-block-file--media-01e3422e-a854-4e68-8d5a-a01c3c42d4da\">Download<\/a><\/div>\n","protected":false},"excerpt":{"rendered":"<p>In the Neural Network, we optimise (minimise) the loss by choosing weights matrix, W and b. $$ arg\\max_{W, b} Loss $$ By taking the F.O.C., we get the gradient, \\nabla. Then, we update weights by minus the learning rate times the gradient. Here, I wrote some test algorithms to find how optimisers evolve. Preparation Let&#8217;s &hellip; <a href=\"https:\/\/fanyuzhao.com\/?p=5051\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Optimiser<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,8],"tags":[],"_links":{"self":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts\/5051"}],"collection":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5051"}],"version-history":[{"count":49,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts\/5051\/revisions"}],"predecessor-version":[{"id":5123,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts\/5051\/revisions\/5123"}],"wp:attachment":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5051"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5051"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5051"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}