{"id":4880,"date":"2023-01-11T11:07:48","date_gmt":"2023-01-11T03:07:48","guid":{"rendered":"https:\/\/fanyuzhao.com\/?p=4880"},"modified":"2023-01-12T10:11:02","modified_gmt":"2023-01-12T02:11:02","slug":"deep-learning-from-scratch","status":"publish","type":"post","link":"https:\/\/fanyuzhao.com\/?p=4880","title":{"rendered":"Deep Learning from Scratch"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">1. Perceptron<\/h2>\n\n\n\n<h4 class=\"wp-block-heading\">1.1 AND &amp; OR &amp; &#8216;N&#8217; Gate<\/h4>\n\n\n\n<p>Linear Classification might be applied by AND &amp; OR.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import numpy as np\n\ndef AND(x1, x2):\n    x = np.array(&#91;x1,x2])\n    w = np.array(&#91;0.5, 0.5])\n    b = -0.7\n    temp = np.sum(w*x )+ b\n    if temp&lt;= 0:\n        return 0\n    else:\n        return 1\n    \ndef NAND(x1,x2):\n    x = np.array(&#91;x1, x2])\n    w = np.array(&#91;-0.5, -0.5])\n    b = 0.7\n    temp = np.sum(w*x)+b\n    if temp &lt;= 0:\n        return 0\n    else: \n        return 1\n\ndef Nand(x1, x2): # same as above\n    return int(not bool(AND(x1,x2)))\n    \n    \ndef OR(x1,x2):\n    x = np.array(&#91;x1, x2])\n    w = np.array(&#91;0.5, 0.5])\n    b = -0.2\n    temp = np.sum(w*x) + b\n    if temp &lt;=0:\n        return 0\n    else:\n        return 1<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">1.2 XOR Gate &#8211; Apply More than Two Perceptrons to Achieve <strong>Non-Linear Classification<\/strong><\/h4>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-1024x161.png\" alt=\"\" class=\"wp-image-4889\" width=\"489\" height=\"77\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-1024x161.png 1024w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-300x47.png 300w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-768x121.png 768w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image.png 1082w\" sizes=\"(max-width: 489px) 100vw, 489px\" \/><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-1.png\" alt=\"\" class=\"wp-image-4890\" width=\"531\" height=\"105\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-1.png 976w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-1-300x60.png 300w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-1-768x153.png 768w\" sizes=\"(max-width: 531px) 100vw, 531px\" \/><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-2.png\" alt=\"\" class=\"wp-image-4891\" width=\"287\" height=\"181\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-2.png 590w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-2-300x189.png 300w\" sizes=\"(max-width: 287px) 100vw, 287px\" \/><\/figure><\/div>\n\n\n<pre class=\"wp-block-code\"><code>def XOR(x1, x2):\n    s1 = NAND(x1, x2)\n    s2 = OR(x1,x2)\n    y = AND(s1, s2)\n    return y<\/code><\/pre>\n\n\n\n<ul><li>It has been proved that any functions could be represented by a combination of 2 perceptrons with sigmoid as the activation function.<\/li><\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-3.png\" alt=\"\" class=\"wp-image-4892\" width=\"251\" height=\"130\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-3.png 796w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-3-300x155.png 300w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-3-768x398.png 768w\" sizes=\"(max-width: 251px) 100vw, 251px\" \/><\/figure><\/div>\n\n\n<p>Apply Multi &#8211; Layers of Perceptrons, we can achieve the simulation of any non-linear transformation. An important part is that Activation Function is necessary to be inserted into different layers. It can be easily proved that: <\/p>\n\n\n\n<p>$$  W_2\\Bigg( \\big(W_1 X + b_1  \\big) \\Bigg)  + b_2 \\equiv W^* x + b^*$$<\/p>\n\n\n\n<p>Multi-layer of Linear Transformation is still Linear Transformation. So, multi-layer of perceptrons becomes useless without Activation Function.<\/p>\n\n\n\n<p>However, if activation function, in other word a non-linear transformation, is added between layers, then the neural network could mimic any non-linear function.<\/p>\n\n\n\n<ul><li>Therefore, what the Deep Learning is doing is to <strong>apply multi-layer perceptrons<\/strong> to mimic the <strong>pattern<\/strong> of a certain thing. Some key points are <ul><li>(1) <strong>multi-layer perceptrons<\/strong> are named the <strong>Neural Network<\/strong>. Each adjacent layers are doing linear transformation <span class=\"katex math inline\">WX + b<\/span> and then apply a <strong>activation function,<\/strong> such as <span class=\"katex math inline\">Sigmoid(x) = \\frac{1}{1+e^{-x}}<\/span>; <\/li><li>(2) Update the weight matrix <span class=\"katex math inline\">W<\/span> and <span class=\"katex math inline\">b<\/span>, until the neural network could output an &#8220;optimal&#8221; result.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li><strong><em>(1) How to define the layers and network structure, (2) how to evaluate the result is &#8220;Optimal&#8221; or not, and (3)<strong><em> how to find the weights <\/em><\/strong><\/em><\/strong>are the main problems in Deep Learning.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">2. Network Structure &#8211; Forward Propagation<\/h2>\n\n\n\n<p>Let&#8217;s start with the first Question. How to define the layers and network structure. Through the Network, <\/p>\n\n\n\n<ol><li>we input Data, <span class=\"katex math inline\">X<\/span>, initially. <\/li><li>Data are transformed by ( Perceptrons &#8211; &#8220;Affine&#8221;, Activation Functions ) multi-times, which are the <strong>multi-layers<\/strong>. <\/li><li>Then, the results are passed through a <strong>Softmax<\/strong> function to reform results into percentages. <\/li><li>Finally, a <strong>Loss<\/strong> is calculate.<\/li><\/ol>\n\n\n\n<p>$$ x \\to f(.) \\to a(.) \\to \\text{\u2026} \\to Softmax(.) \\to Loss(.)$$<\/p>\n\n\n\n<p>, where <span class=\"katex math inline\">f(x) = xw + b<\/span><\/p>\n\n\n\n<p>, where <span class=\"katex math inline\">a(x)<\/span> is the activation function.<\/p>\n\n\n\n<p>, where <span class=\"katex math inline\">Loss(x)<\/span> is the loss function.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2.1 Affine &#8211; Linear Transformation<\/h4>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-4.png\" alt=\"\" class=\"wp-image-4931\" width=\"440\" height=\"79\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-4.png 896w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-4-300x54.png 300w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-4-768x139.png 768w\" sizes=\"(max-width: 440px) 100vw, 440px\" \/><\/figure><\/div>\n\n\n<p>$$ f(X) = X \\cdot W +b $$<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2.2 Activation Function<\/h4>\n\n\n\n<ul><li>ReLU,  <span class=\"katex math inline\">f(x) = max(x, 0<\/span><\/li><\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-5.png\" alt=\"\" class=\"wp-image-4941\" width=\"218\" height=\"166\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-5.png 543w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-5-300x228.png 300w\" sizes=\"(max-width: 218px) 100vw, 218px\" \/><\/figure><\/div>\n\n\n<ul><li>Sigmoid, <span class=\"katex math inline\">f(x) = \\frac{1}{1+e^{-x}}<\/span><\/li><\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-6.png\" alt=\"\" class=\"wp-image-4942\" width=\"222\" height=\"168\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-6.png 547w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-6-300x227.png 300w\" sizes=\"(max-width: 222px) 100vw, 222px\" \/><\/figure><\/div>\n\n\n<ul><li>tanh<\/li><\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-7.png\" alt=\"\" class=\"wp-image-4943\" width=\"251\" height=\"182\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-7.png 568w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-7-300x218.png 300w\" sizes=\"(max-width: 251px) 100vw, 251px\" \/><\/figure><\/div>\n\n\n<p>There are some properties of activation function, such as symmetric to the zero point, differentiable, etc. Those details are not discuss here.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2.3 Softmax<\/h4>\n\n\n\n<p>$$ \\vec{X} = (x_1, \u2026, x_i, ..) $$<\/p>\n\n\n\n<p>$$ Softmax(\\vec{X}) = \\vec{Y} = (\u2026, \\frac{e^{x_i}}{\\sum e^{x_i}} ,\u2026)^T $$<\/p>\n\n\n\n<p>Inputs are reform to be percentages, and those percentages are sum to be one. <\/p>\n\n\n\n<p>P.S.<\/p>\n\n\n\n<p>$$ y_k = \\frac{e^{a_k}}{\\sum_{i=1}^{n}e^{a_i} } = \\frac{C e^{a_k}}{C \\sum_{i=1}^{n}e^{a_i} } $$<\/p>\n\n\n\n<p>$$ = \\frac{e^{a_k+ lnC} }{\\sum_{i=1}^{n}e^{a_i + lnC} } $$<\/p>\n\n\n\n<p>$$ = \\frac{e^{a_k &#8211; C&#8217;} }{\\sum_{i=1}^{n}e^{a_i &#8211; C&#8217;} } $$<\/p>\n\n\n\n<p>\u4e3a\u4e86\u9632\u6b62\u6ea2\u51fa\uff0c\u4e8b\u5148\u628ax\u51cf\u53bb\u6700\u5927\u503c\u3002\u6700\u5927\u503c\u662f\u6709\u6548\u6570\u636e\uff0c\u5176\u4ed6\u503c\u6ea2\u4e0d\u6ea2\u51fa\u53ef\u7ba1\u4e0d\u4e86\uff0c\u4e5f\u4e0d\u5173\u5fc3\u3002<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2.4 Loss Function<\/h4>\n\n\n\n<ul><li>Cross Entropy, <span class=\"katex math inline\">L = -\\sum t_i \\ ln(y_i) = &#8211; ln(\\vec{Y}) \\cdot \\mathbf{t}^T<\/span><\/li><li>MSE, See &#8220;Linear Regression Paddle&#8221; note. <span class=\"katex math inline\">E = \\frac{1}{2} \\sum_k (y_k &#8211; t_k)^2<\/span><\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">3. Minimise Loss &amp; Update Weights<\/h2>\n\n\n\n<p>We consider the &#8220;Optimal&#8221; weights are such that they can result in minimum loss. So, in other words, we aim to find <span class=\"katex math inline\">w<\/span> and <span class=\"katex math inline\">b<\/span> that can minimise the loss function.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">How to Do It ?<\/h4>\n\n\n\n<p>$$ arg\\min_{w, b} Loss $$<\/p>\n\n\n\n<p>$$ \\hat{w}: \\frac{\\partial L}{\\partial w} = 0 ,\\quad \\hat{b}: \\frac{\\partial L}{\\partial b} = 0$$<\/p>\n\n\n\n<p>Remember, the pathway:<\/p>\n\n\n\n<p>$$ x \\to f(.) \\to a(.) \\to \\text{\u2026} \\to Softmax(.) \\to Loss(.)$$<\/p>\n\n\n\n<p>To solve the `F.O.C, we need to apply the Chain rule backward, which is called the <strong>Backward Propagation.<\/strong><\/p>\n\n\n\n<p><strong>By Chain Rule,<\/strong><\/p>\n\n\n\n<p>$$ \\frac{\\partial Loss(w)}{\\partial w} = \\frac{\\partial Loss(.)}{\\partial Softmax} \\cdot \\frac{\\partial Softmax(.)}{\\partial a} \\cdot \\frac{\\partial a}{\\partial f} \\cdot \\frac{\\partial f}{\\partial a} \u2026 \\cdot \\frac{\\partial f}{\\partial w} $$<\/p>\n\n\n\n<p>Let&#8217;s Decompose each Part in the Following.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">3.1 Loss<\/h4>\n\n\n\n<p>$$ \\frac{\\partial Loss(\\vec{Y}, \\vec{t})}{\\partial \\vec{Y}} $$<\/p>\n\n\n\n<ul><li>Cross Entropy<\/li><\/ul>\n\n\n\n<p>$$ L = -\\sum t_i \\ ln(y_i) = &#8211; ln(\\vec{Y}) \\cdot \\mathbf{t}^T$$<\/p>\n\n\n\n<p>$$\\frac{\\partial L}{\\partial y_i} = &#8211; \\frac{t_i}{y_i}$$<\/p>\n\n\n\n<p>$$\\frac{\\partial L}{\\partial \\vec{Y}} = &#8211; \\big(\u2026,\\frac{t_i}{y_i},\u2026\\big)^T$$<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">3.2 Softmax<\/h4>\n\n\n\n<p>$$ \\vec{X} = (x_1, \u2026, x_i, ..) $$<\/p>\n\n\n\n<p>$$ Softmax(\\vec{X}) = \\vec{Y} = (\u2026, \\frac{e^{x_i}}{\\sum e^{x_i}} ,\u2026)^T $$<\/p>\n\n\n\n<p>$$ \\frac{\\partial Softmax(\\vec{X})}{\\partial \\vec{X}} = Diag(\\vec{Y}) &#8211; Y\\cdot Y^T $$<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>Deviation is as the Following,<\/strong><\/p>\n\n\n\n<p><a href=\"https:\/\/blog.csdn.net\/Wild_Young\/article\/details\/121912675\">https:\/\/blog.csdn.net\/Wild_Young\/article\/details\/121912675<\/a><\/p>\n\n\n\n<p>$\\frac{\\partial Y}{\\partial X}$:<\/p>\n\n\n\n<p>$$Softmax(x) = \\frac{1}{1+e^{-x}}$$<\/p>\n\n\n\n<p>Let <span class=\"katex math inline\">X<\/span> be a vector with shape = (n,1), and<br>$Y = Softmax(X) $.<\/p>\n\n\n\n<p>That means, for each element of <span class=\"katex math inline\">y_i<\/span> in Y<\/p>\n\n\n\n<p>$$ y_k = \\frac{-e^{x_k}}{\\sum_i^N e^{x_i}} $$<\/p>\n\n\n\n<p>So,<\/p>\n\n\n\n<p>$$ \\frac{\\partial Y}{\\partial x_k} $$<\/p>\n\n\n\n<p>If <span class=\"katex math inline\">j = i<\/span>,<\/p>\n\n\n\n<p>$$ \\frac{\\partial y_j}{\\partial x_i} = \\frac{\\partial }{\\partial x_i} \\Bigg( \\frac{e^{x_i}}{\\sum e^{x_i}} \\Bigg) $$<\/p>\n\n\n\n<p>$$ = \\frac{ (e^{x_i})&#8217;\\sum e^{x_i} &#8211; e^{x_i} (\\sum e^{x_i})&#8217; }{\\big( \\sum e^{x_i} \\big)^2} $$<\/p>\n\n\n\n<p>$$ =\\frac{e^{x_i}}{\\sum e^{x_i}} &#8211; \\frac{e^{x_i}}{\\sum e^{x_i}} \\cdot \\frac{e^{x_i}}{\\sum e^{x_i}}$$<\/p>\n\n\n\n<p>$$= y_i &#8211; y_i^2 =y_i(1-y_i)$$<\/p>\n\n\n\n<p>If <span class=\"katex math inline\">j \\neq i<\/span>,<\/p>\n\n\n\n<p>$$ \\frac{\\partial y_j}{\\partial x_i} = \\frac{\\partial }{\\partial x_i} \\Bigg( \\frac{e^{x_j}}{\\sum e^{x_i}} \\Bigg) $$<\/p>\n\n\n\n<p>$$ = \\frac{-e^{x_i} e^{x_j}}{(\\sum e^{x_i})^2} $$<\/p>\n\n\n\n<p>$$ =- \\frac{e^{x_j}}{\\sum e^{x_i}} \\cdot \\frac{e^{x_i}}{\\sum e^{x_i}} $$<\/p>\n\n\n\n<p>$$ = -y_j y_i $$<\/p>\n\n\n\n<p>$$ \\frac{\\partial y_j}{\\partial x_i}=y_i &#8211; y_i^2 \\quad \\text{,if $i = j$} $$<\/p>\n\n\n\n<p><br>$$\\frac{\\partial y_j}{\\partial x_i} = -y_j y_i \\quad \\text{,if $i \\neq j$} $$<\/p>\n\n\n\n<p>Therefore, for <span class=\"katex math inline\">\\frac{\\partial Y}{\\partial X}<\/span>, we got the Jacobian matrix,<\/p>\n\n\n\n<p>$$\\frac{\\partial \\vec{Y}}{\\partial \\vec{X}} = Diag(Y) &#8211; Y^T Y$$<\/p>\n\n\n\n<p>However, in the backward propagation of Softmax, we only need the diagonal, which is in the case of i = j.<\/p>\n\n\n\n<p>$$ \\frac{\\partial y_i}{\\partial x_i} = y_i-y_i^2 =y_i(1-y_i)$$<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">3.3 Loss (Cross-Entropy) and Softmax<\/h4>\n\n\n\n<p>There is a trick with the combination between cross-entropy-error and softmax.<\/p>\n\n\n\n<p>The Cross-Entropy Loss function is, <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-8.png\" alt=\"\" class=\"wp-image-4985\" width=\"367\" height=\"88\" srcset=\"http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-8.png 884w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-8-300x72.png 300w, http:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/image-8-768x184.png 768w\" sizes=\"(max-width: 367px) 100vw, 367px\" \/><\/figure><\/div>\n\n\n<p>, where <span class=\"katex math inline\">\\vec{Y_{log}}^T<\/span> is apply log to each element of <span class=\"katex math inline\">\\vec{Y}<\/span>, and then take transformation.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.matrixcalculus.org\">https:\/\/www.matrixcalculus.org<\/a><\/p>\n\n\n\n<p>$$\\frac{\\partial L}{\\partial \\vec{Y}} = \\vec{\\mathbf{t}}^T \\cdot Diag(Y_{1\/Y}) $$<\/p>\n\n\n\n<p>, where <span class=\"katex math inline\">Y_{1\/Y}<\/span> is <span class=\"katex math inline\">1\/y_i<\/span> for each element of <span class=\"katex math inline\">Y<\/span>.<\/p>\n\n\n\n<p>$$ \\frac{\\partial L}{\\partial x_k } = \\frac{\\partial L}{\\partial y_k}\\cdot \\frac{\\partial y_k}{\\partial x_k} $$<\/p>\n\n\n\n<p>$$\\frac{\\partial L}{\\partial x_k } =- \\sum_i \\frac{\\partial }{\\partial {y}_k} \\bigg( t_i \\ ln(y_i) \\bigg) \\cdot \\frac{\\partial y_k}{\\partial x_k} $$<\/p>\n\n\n\n<p>$$\\frac{\\partial L}{\\partial x_k } =- \\sum_i \\frac{t_i }{{ y}_i} \\cdot \\frac{\\partial {y}_i}{\\partial x_k} $$<\/p>\n\n\n\n<p>Plug in the derivatives of <strong>Softmax<\/strong>, <span class=\"katex math inline\">\\frac{\\partial y}{\\partial x}<\/span><\/p>\n\n\n\n<p>$$\\frac{\\partial L}{\\partial x_k } = &#8211; \\frac{t_k }{{ y}_k}<br>\\cdot \\frac{\\partial {y}_k}{\\partial x_k} \\sum_{i\\neq k} \\frac{t_i }{{ y}_i} \\cdot \\frac{\\partial {y}_i}{\\partial x_k} $$<\/p>\n\n\n\n<p>$$ = -\\frac{t_k }{{ y}_k}<br>\\cdot {y}_k (1-{y}_k) \\sum_{i\\neq k} \\frac{t_i }{{ y}_i} \\cdot (- {y}_i {y}_k)<br>$$<\/p>\n\n\n\n<p>$$ = &#8211; t_k (1-{y}_k) \\sum_{i\\neq k} t_i {y}_k<br>$$<\/p>\n\n\n\n<p>$$ = &#8211; t_k \\sum_{i} t_i \\ {y}_k<br>$$<\/p>\n\n\n\n<p>$$ = &#8211; t_k {y}k \\sum{i} t_i \\quad\\text{by the tag, $t$, is sum to be 1}<br>$$<\/p>\n\n\n\n<p>$$ \\frac{\\partial L}{\\partial x_k } ={y}_k &#8211; t_k$$<\/p>\n\n\n\n<p>So, we get <span class=\"katex math inline\">\\frac{\\partial L}{\\partial x_k} ={y}_k &#8211; t_k<\/span>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">3.4 Affine<\/h4>\n\n\n\n<p>$$ \\frac{\\partial f}{\\partial w} $$<\/p>\n\n\n\n<p>$$ \\frac{\\partial f}{\\partial b} $$<\/p>\n\n\n\n<ul><li>Backward<\/li><\/ul>\n\n\n\n<p>$$ \\frac{\\partial f}{\\partial X} =\\cdot \\ W^T \\quad \\text{, Right Multiplied}$$<\/p>\n\n\n\n<p>$$ \\frac{\\partial f}{\\partial W} = X^T \\cdot \\ \\quad \\text{, Left Multiplied}$$<\/p>\n\n\n\n<p>$$ \\frac{\\partial f}{\\partial b} = \\mathbf{1}^T \\cdot \\quad \\text{, Left Multiplied} $$<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Other Theoretical Details<\/h2>\n\n\n\n<ol><li><strong>Batch<\/strong>: mini-batch and epoch for improving training accuracy.<\/li><li><strong>Weights Initialisation<\/strong>: Xavier &#8211; Sigmoid and tank, He &#8211; ReLU<\/li><li><strong>Weights Updating<\/strong>: SGD, Momentum, Adam, etc.<\/li><li><strong>Overfitting<\/strong>: Batches, Regularisation, Dropout.<\/li><li><strong>Hyperparameters<\/strong>: Bayes<\/li><\/ol>\n\n\n\n<p>See Chapter 6 of the book.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5. Code<\/h2>\n\n\n\n<p>See Attachment for the Code Realisation by purely Numpy.<\/p>\n\n\n\n<div class=\"wp-block-file\"><a id=\"wp-block-file--media-b69af0c0-b012-4e4b-880e-a8d632388b18\" href=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/Collection_of_Classes.html\">Collection_of_Classes<\/a><a href=\"https:\/\/fanyuzhao.com\/wp-content\/uploads\/2023\/01\/Collection_of_Classes.html\" class=\"wp-block-file__button\" download aria-describedby=\"wp-block-file--media-b69af0c0-b012-4e4b-880e-a8d632388b18\">Download<\/a><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Code and Books: http:\/\/82.157.170.111:1011\/s\/9D9BCBbCop6ERaD<\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Perceptron 1.1 AND &amp; OR &amp; &#8216;N&#8217; Gate Linear Classification might be applied by AND &amp; OR. 1.2 XOR Gate &#8211; Apply More than Two Perceptrons to Achieve Non-Linear Classification It has been proved that any functions could be represented by a combination of 2 perceptrons with sigmoid as the activation function. Apply Multi &hellip; <a href=\"https:\/\/fanyuzhao.com\/?p=4880\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Deep Learning from Scratch<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,8],"tags":[],"_links":{"self":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts\/4880"}],"collection":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4880"}],"version-history":[{"count":115,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts\/4880\/revisions"}],"predecessor-version":[{"id":5048,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=\/wp\/v2\/posts\/4880\/revisions\/5048"}],"wp:attachment":[{"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4880"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4880"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fanyuzhao.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4880"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}