Machine Learning – Fanyu Zhao

Sigmoid & Logistic

Sigmoid function is largely used for the binary classification, in either machine learning algorithm or econometrics.

Why the Sigmoid Function shapes in this form?

Firstly, let’s introduce the odds.

Odds provide a measure of the likelihood of a particular outcome. They are calculated as the ratio of the number of outcomes that produce that outcome to the number that do not.

Odds also have a simple relation with probability: the odds of an outcome are the ratio of the probability that the outcome occurs to the probability that the outcome does not occur. In mathematical terms, p is the probability of the outcome, and 1-p is the probability of not occurring.

$$ odds = \frac{p}{1-p} $$

Odd and Probability

Let’s find some insights behind the probability and the odd. Probability links with the outcomes in that for each outcomes, the probability give its specific corresponding probability. $Pr(Y)$ , where $Y$ is the outcome, and $Pr(\cdot)$ is the probability density function that project outcomes to it’s prob.

What about the odds? Odds is more like a ratio that is calculated by the probability as the formula says.

Implication: Compared to the probability, odds provide more about how the binary classification is balanced or not, but the probability distribution.

Example

Rolling a six-side die. The probability of rolling 6 is $1/6$ , but the odd is $1/5.

Formula

$$ odd = \frac{Pr(Y)}{1-Pr(Y)} $$

, where $Y$ is the outcomes.

Logit

As the probability $Pr(Y)$ is always between $[0,1]$ , the odds must be non-negative, $odd \in [0,\infty]$ . We may want to apply a monotonic transformation to re-gauge that range of odds. We will apply on the logarithm.

$$ Sigmoid/Logistic := log(odds) =log\bigg( \frac{Pr(Y)}{1-Pr(Y)} \bigg) $$

We then get the Sigmoid function.

As the transformation we apply on is monotonic, the Sigmoid function remains the similar properties as the odd. The Sigmoid function keeps the similar implication, representing the balance of the binary outcomes.

Then, we bridge $Y = f(X)$ , the outcome $Y$ is a function of events $X$ . Here, we assume a linear form as $Y = X\beta$ . The sigmoid function would then become a function of $X$ .

$$g(X) = log\bigg( \frac{Pr(X\beta)}{1-Pr(X\beta)} \bigg) $$

$$ e^g = \frac{p}{1-p} $$

$$ p = \frac{e^g}{e^g+1}=\frac{1}{1+e^{-g}}$$

$$ p = \frac{1}{1+e^{-X\beta}}$$

We finally get out logistic sigmoid function as above.

因子投资的高维数时代

摘要：实证资产定价已然进入因子（协变量）的高维数时代。本文抛砖引玉，阐述我对此的四点思考。（DGP: Data Generating Process）

Key Takeaway: over-parameterization 的时代已经到来，变量数量 k 大于样本数量 t 似乎并没有带来 interpolate的问题，反而给prediction带来了帮助。

0 引子

时至今日，实证资产定价（以及因子投资）已然步入了因子（协变量）的高维数时代。大量发表在顶刊上的实证结果表明，多因子模型具有很大的不确定性且因子的稀疏性假设不成立。人们熟知的 ad-hoc 简约模型无法指引未来的投资。

在高维数时代，寻找真正能够预测预期收益率的协变量是核心问题之一。为了实现这个目标，需要考虑的问题包括：（1）多重假设假设检验；（2）投资者（高维）学习问题 & 另类数据；（3）来自资产定价理论的指引：即解释预期收益率的因子应该也能解释资产的共同波动。最后，一个最新的讨论热点是因子的个数是否越多越好（即模型复杂度是否越高越好）：复杂模型能更好地逼近真实 DGP，但参数估计的方差更大；简约模型的参数估计更准确，但却未必是 DGP 的合理近似。二者相比，如何权衡呢？

近日，我在某券商 2023 的年度策略会上做了题为《因子投资的高维数时代》的报告，阐述了我对上述四点的思考。本文借着报告的 slides 做简要介绍。由于对于某些问题专栏已经做了大量的梳理（比如多重假设检验），因此在本文的阐述中，在必要的地方会使用最少的文字（你马上就会明白我的意思）。

1 多重假设检验

这部分，一图胜千言。需要相关知识的小伙伴，请查看公众号的《出色不如走运》系列。

Next.

2 投资者学习问题 & 另类数据

理性预期假设投资者知道真实的估值模型。

然而，和进行事后（ex post）因子分析的你我一样，投资者在投资时同样面临协变量的高维数问题，因此不可能知道真实的估值模型，所以理性预期假设并不成立。这造成的结果是，均衡状态下资产价格和理性预期情况下相比出现偏差。在事后分析中，已实现收益率中包含一部分因估计误差导致的可预测成分。但对投资者来说，事前（ex ante）无法利用上述可预测性。

因误差导致的可预测性能够在样本内（IS）产生虚假的可预测性（无论投资者是否使用了先验以及无论先验是否正确），而在样本外（OOS）却无法预测收益率。这就是投资者（高维）学习问题导致的虚假的可预测性（Martin and Nagel 2022）。具体阐述见《False In-Sample Predictability ?》。面对这个问题，需要通过 OOS 检验才能规避。

投资者无法在事前投资中应对高维数，这主要体现在他们使用较少的协变量（因子）作为估值的依据。另一方面，由于一些变量的获取成本很高，投资者需要在该变量带来的预测好处和其成本之间权衡。此外，有限理性中的有限注意力机制也为投资者对简约性的渴望提供了微观基础。这两方面作用合力导致投资者在为资产定价时使用过度稀疏的估值模型。这样做的后果是，即便在样本外，也会出现因投资者学习问题而造成的虚假的可预测性。

就着上述推论，我们自然地引出本小节的另一个相关话题：另类数据。

回忆一下公众号之前的文章《科技关联度II》所介绍的 Bekkerman, Fich and Khimich (forthcoming)。相比于之前的基于专利类别的研究，该文对专利进行文本分析，通过提取专业术语并计算其重合度来描述公司之间的相似程度，以此构造了预期超额收益率更高的科技关联度效应。

比起专利类别，投资者在获得以及处理专利文本并计算科技关联度时的成本更加昂贵。这会导致大多数投资者会在为公司估值时忽略这方面的信息，即使用过度稀疏的估值模型，造成样本内和样本外收益率可预测性。

该文基于文本分析的科技关联度是近几年大红大紫的基于另类数据进行实证资产定价和因子投资的一个典型例子。然而，Martin and Nagel (2022) 所勾勒出的非理性预期假设世界告诉我们，使用最新的方法和技术构建预测变量并将其应用于早期历史时段时，它们在样本内和（伪）样本外检验中均能预测预期收益率。因此，我们在惊喜于另类数据的发现之余，恐怕也应该多一分谨慎。

除此之外，既然谈到另类数据，不妨再聊聊另一个相关话题。有相关研究表明，海外大量的另类数据供应商提供的数据都只具备对公司基本面的短时间尺度的可预测性。Dessaint, Foucault and Fresard (2020) 的研究表明，如此另类数据可得性的提升降低了进行短时间尺度的预测成本（从而提高了准确性），但增加了进行长时间尺度预测的成本（从而降低了准确性）。对于公司基本面预测来说，二者的综合效果是 mixed。可以预见，未来在使用另类数据预测公司基本面时，会有更多的研究向这个方向倾斜。

3 和协方差矩阵有关

Ross (1976) 的 APT 指出，解释资产预期收益率截面差异的因子应该同时能够解释资产的共同运动。在市场中不存在近似无风险套利机会这个假设下，Kozak, Nagel and Santosh (2018) 同样论述了这一点（见《Which beta (III)?》）。

以下展示了 BetaPlus 小组构造并维护的 A 股常见七类风格因子在 2000/01/01 到 2022/04/30 之间的表现。统计数据表明，在该实证区间内，虽然它们的收益率均值高低有差异，但收益率的标准差同样也有差异，因此并没有哪个风格因子的风险调整后收益明显高于其他的因子。

再来看一个美股的例子。Kozak, Nagel and Santosh (2018) 将实证区间分成前后两半儿并考察了 15 个因子。下图展示了每个因子在前后两个区间内夏普率的散点图。如果能够在获得高收益的同时降低波动，那么样本内（前一半区间）夏普率高的因子在样本外（后一半区间）的夏普率应该仍然更高一些，我们将会看到这些点围绕在 45 度直线上。然而事实并非如此。无论样本内的夏普率多高，这 15 个因子样本外的夏普率几乎是一条平行于横坐标的水平线，而非人们期望的 45 度斜线。（我用几百个因子在 A 股做了同样的实证，观察到了类似的结果。）

上述结果显示，（样本外）高收益往往对应着高波动（对着样本内硬挖 —— data snooping —— 另说），这一实证结果和 APT 吻合。

早在几十年前，Eugene Fama 曾经打趣到 APT 让众多挖因子的尝试“合理化”，即 APT 只说了资产预期收益率和众多因子有关，但却没有指出到底有哪些因子。因此，很多学者打着 APT 的旗号“肆无忌惮”地挖出了一茬又一茬因子（zoo of factors），Fama 把这个现象称作 APT 给了这些研究“fishing license”（即 APT 让这些研究合理化）。（Sorry，这里我实在忍不住吐槽一句，在一本著名的资产定价教材的中译版中，中文作者竟然真的把 fishing license 翻译成“钓鱼许可证”……）

如今，当我们重新审视 APT 时，毫无疑问应该将它作为挖掘真实因子的有效指引，正如本节一开头说的那样：解释资产预期收益率截面差异的因子应该也能解释资产的共同运动。在这个认知下，以 PCA 为代表的一系列实证资产定价研究在这几年取得了很多突破（Kelly, Pruitt and Su 2019 、Kozak, Nagel and Santosh 2020）。

4 越复杂越好 ?

在本节的讨论中，我们以因子个数的多少代表模型复杂度。因子个数越多，模型越复杂。

2019 年，Belkin, et al. (2019) 一文提出了机器学习中样本外误差的“double descent”现象，引发了机器学习领域和理论统计领域的广泛讨论。为了理解这一现象，我们先从熟知的 bias-variance trade-off 说起。

对于模型来说，其样本外表现和模型复杂度关系密切。当模型复杂度很低时，模型的方差很小（因为变量参数估计的方差很小），但是偏差很高；当模型复杂度高时，模型的方差变大，但是偏差降低。二者的共同作用就是人们熟悉的 U-Shape，即 bias-variance trade-off，因此存在某个最优的超参数，使得样本外的总误差（风险）最低。

我们还可以换个角度来理解 bias-variance trade-off，而这个角度对理解 double descent 至关重要。当模型很简单时，它能够有效规避过拟合，但却很难想象如此简单的模型是真实世界的好的近似；而当模型复杂时，它更有可能逼近真实世界，但是也的确更容易过拟合。因此 bias-variance trade-off 也可以理解为 approximation-overfit trade-off。

然而，上述结论有一个我们都习以为常的前提：变量个数 < 样本个数。那么，如果模型复杂到变量（因子）的个数超过了样本的个数又会出现怎样的情况呢？事实上，这一问题并非无缘无故的凭空想象。对于复杂的神经网络模型来说，模型参数的个数很容易超过样本的个数，然而这些模型确在样本外有着非凡的表现（哦，当然不是资产定价领域）。这个现象促使这人们搞清楚 what is behind the scene。

当变量个数 > 样本个数时，模型在样本内能够完美的拟合全部样本（在机器学习术语中，这个现象被称为 interpolation）。对这样一个模型来说，人们通常的认知是，它在样本外的表现一定会“爆炸”，即毫无作为。这是因为它过度拟合了样本内数据中的全部噪声。然而，Belkin, et al (2019) 指出，当人们让模型复杂度突破样本个数这个“禁忌之地”后，神奇的事情发生了：样本外总误差并没有“爆炸”，而是随着复杂度的提升单调下降。正因为在样本个数两侧都出现了误差单调下降的情况，Belkin, et al (2019) 将这个现象称为 double descent。

因为本文的目的并非解释背后的统计学理论，所以我在此对该现象给一些直觉上的解释。当变量个数超过样本个数的时候，样本内的解是不唯一的，而最优的解可以理解为满足参数的方差最小（正则化或 implied 正则化在这个过程中发挥了非常重要的作用）。随着变量越来越多，最优解的方差总能单调下降。

再来看偏差，通常来说，偏差确实会随着复杂度的提升而增加。但是所有模型都是真实 DGP 的某个 mis-specified 版本。当存在模型设定偏误的时候，可以证明当变量个数超过样本个数时，偏差也会在一定范围内随着复杂度而下降。因此，二者的综合结果就是模型在样本外的误差表现会随复杂度的上升而下降。（在一些情况下，样本外误差的 global minimum 出现在当变量个数 > 样本个数时。）

以下两张 slides 总结了上面的话（第二张 slide 里的表参考了 Bryan Kelly 的 talk，特此说明）。

对于资产定价和因子投资来说，如果你和我一样认同因子的高维数时代 —— 即收益率的 DGP 包含了非常多的因子，那么上述关于模型复杂度的探讨也许会带来全新而有益的启发。在这方面，也有大佬已经走在了前面。Bryan Kelly 和他的合作者以及学生一起写了一系列“复杂度美德”的 working papers，在资产定价领域探索提升复杂度带来的样本外好处。例如，Kelly, Malamud and Zhou (2022) 一文使用神经网络对美股进行了择时（每次建模仅利用一年 12 期的数据训练神经网络），并发现了类似的 double descent 现象。

当然，即便我们认同了“越复杂越好”，也依然要回答更重要的问题，即如何估计参数，如何正则化，如何来利用成千上万甚至更多的因子来形成关于预期收益率更好的预测。虽然 Kelly 等人的文章在择时方面取得了让人兴奋的结果，但在 cross-section 是否有类似的实证结果依然需要时间来回答（Kelly 有一篇 working paper 研究 cross-section，但还没有 publicly available）。

但是无论如何，欢迎来到 over-parameterization 时代。

5 结束语

以上就是我对因子投资高维数时代的四点思考。

不过在本文的最后，仍然有必要指出，在协变量的高维数时代，如何 prepare 因子固然重要（小心多重假设检验、小心投资者学习、利用 APT 的 implication），但是如何求解高维问题才更加核心（如何利用复杂度的好处 ?）。

或许，我们已经到了从计量经济学到机器学习的必然转型时刻。正如 Stefan Nagel 的《机器学习与资产定价》（Nagel 2021）所倡导的那样，将经济学推理注入机器学习算法将成为高维数时代研究的必经之路。

参考文献

Baba-Yara, F., B. Boyer, and C. Davis (2021). The factor model failure puzzle. Working paper.
Bekkerman, R., E. M. Fich, and N. V. Khimich (forthcoming). The effect of innovation similarity on asset prices: Evidence from patents’ big data. Review of Asset Pricing Studies.
Belkin, M., D. Hsu, S. Ma, and S. Mandal (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. PNAS116(32), 15849 – 15854.
Dessaint, O., T. Foucault, and L. Fresard (2020). Does alternative data improve financial forecasting? The horizon effect. Working paper.
Kelly, B. T., S. Malamud, and K. Zhou (2022). The virtue of complexity in return prediction. Working paper.
Kelly, B. T., S. Pruitt, and Y. Su (2019). Characteristics are covariances: A unified model of risk and return. Journal of Financial Economics 134(3), 501 – 524.
Kozak, S., S. Nagel, and S. Santosh (2018). Interpreting factor models. Journal of Finance73(3), 1183 – 1223.
Kozak, S., S. Nagel, and S. Santosh (2020). Shrinking the cross-section. Journal of Financial Economics 135(2), 271 – 292.
Linnainmaa, J. T. and M. R. Roberts (2018). The history of the cross-section of stock returns. Review of Financial Studies31(7), 2606 – 2649.
Martin, I. and S. Nagel (2022). Market efficiency in the age of big data. Journal of Financial Economics145(1), 154 – 177.
Nagel, S. (2021). Machine Learning in Asset Pricing. Princeton University Press.
Ross, S. A. (1976). The arbitrage theory of capital asset pricing. Journal of Economic Theory 13(3), 341 – 360.

By

石川 https://zhuanlan.zhihu.com/p/589370949?utm_medium=social&utm_oi=774013724896788480&utm_psn=1583185642497994752&utm_source=wechat_session&utm_id=0

Gradient / Derivative in Python

By definition:

$$ \nabla f(x_1, x_2) =\frac{\partial f}{\partial x_1} + \frac{\partial f}{\partial x_1} $$

$$ \frac{\partial f}{\partial x_1} = \lim_{h\rightarrow 0}\frac{f(x+h)-f(x)}{h} $$

$$ \frac{\partial f}{\partial x_1} = \lim_{h\rightarrow 0}\frac{f(x+h)-f(x-h)}{2h} $$

Code:

func_1 = lambda x: x**2 +5
func_2 = lambda x: x[0]**2 + x[1]**3 +1
x_lim = np.arange(-5,5,0.01)
input_val = np.array([2.0,3.0])


class Differentiate:
    def __init__(self):
        self.h = 1e-5
        self.dx = None
        
    def d1_diff(self, f, x):
        fh1 = f(x+self.h)
        fh2 = f(x-self.h)
        self.dx = (fh1 - fh2)/(2*self.h)
        return self.dx
        
    def tangent(self, series, f, x_loc):
        """
        Return a Tangent Line, for ploting.
        """
        y_loc = f(x_loc)
        self.d1_diff(func_1, x_loc)
        b = y_loc - self.dx * x_loc
        y_series = self.dx * series + b
        return y_series
    
    # for f(x1, x2, x_3, ...)
    def dn_diff(self, f, x):
        grad = np.zeros_like(x)
        for i in range(len(x)):
            temp_val = x[i]
            x[i] = temp_val + self.h
            fxh1 = f(x)
            x[i] = temp_val - self.h
            fxh2 = f(x)
            grad[i] = (fxh1 - fxh2) / (2*self.h)
            x[i] = temp_val
            self.dx = grad
        return self.dx
    
    def gradient_descent(self, f, init_x, lr = 0.01, step_num = 1000):
        x = init_x   
        for i in range(step_num):
            self.dn_diff(f, x)
            x -= lr * self.dx
        return x

Optimiser

In the Neural Network, we optimise (minimise) the loss by choosing weights matrix, $W$ and $b$ .

$$ arg\max_{W, b} Loss $$

By taking the F.O.C., we get the gradient, $\nabla$ . Then, we update weights by minus the learning rate times the gradient.

Here, I wrote some test algorithms to find how optimisers evolve.

Preparation

Let’s make some preparations and notation clarifications.

Notation:

Y is True,
T is the Estimate.

$$Y = XW + b + e$$

Let $T = WX +b$

$$ loss = e^T e $$

$$ loss = (Y-T)^T \cdot (Y-T) $$

$$= (WX+b+e-T)^T \cdot (WX+b+e-T) $$

$$ \frac{\partial Loss}{\partial W} = X^T (WX+b+e-T) = 2X^T (Y-T) $$

$$ \frac{\partial loss }{\partial b} = 1^T (WX+b+e-T) = 1^T (Y-T)$$

Just One More Thing We Consider Here!

We do not want the number of sample to affect loss function’s value, so we take averge, instead of sumation.

$$ \frac{\partial Loss}{\partial W} = \frac{2}{N}\ X^T (Y-T) \quad, \quad \frac{\partial loss }{\partial b} = \frac{1}{N} \ 1^T (Y-T) $$

Optimisers

Gradient Descent

$$ W_t = W_{t-1} – \eta \nabla_W $$

$$ b_t = b_{t-1} – \eta \nabla_b $$

, where $\eta$ is the learning rate.

However, we may find that the Gradient Descent might not satisfactory, because gradients might not be available or vanished sometimes. Also, the process of gradient descent largely depends on the HyperParameter $\eta$ . Some other methods of optimisation are developed.

Different Optimisers aim to find the optimal weights more speedy and more accurate. In the followings are how optimisers are designed.

Stochastic Gradient Descent – SGD

In SGD, the stochastic part is added to avoid model train full dataset, and avoid resulting in the problem of overfitting.

Momentum SGD

A momentum term is added.

Apply Momentum to the Gradient, $\nabla$ .

$$\text{Gradient Descent:} \quad w_t = w_{t-1} -\eta g_w$$

$$ \text{Momentum:}\quad v_t = \beta_1 v_{t-1} + (1-\beta_1)g_w $$

In the Beginning of Iteration, $v_0=0$ , so we amend it to be $\hat{v_T}$

$$ \hat{v_t} = \frac{v_t}{1-\beta_1^t} $$

,where $\beta_1$ assign weights between previous value and the gradient.

Replace the gradient $g_w$ by the amended momentum term $\hat{v}_t$ :

$$ w_t = w_{t-1} -\eta \hat{v}_t $$

P.S. Why we amend $v_t$ by dividing $1 + \beta^t$ .

As $v_t = \beta_1 v_{t-1} + (1-\beta_1)g_w$ , which is like a geometric decaying polynomial function. The difference is that $w$ and $g_w$ keep updating each period. Let’s assume there is no updating anymore, or in other word, model has converge. $g_w = constant$ . $\{ v_t \}_{t=0}^T$ would be,

$$ v_0 = 0 $$

$$ v_1 = (1-\beta_1)g_w $$

$$ v_2 = \beta_1 (1-\beta_1)g_w + (1-\beta_1) g_w $$

$$ v_3 = \beta_1^2(1-\beta_1)g_w + \beta_1 (1-\beta_1)g_w + (1-\beta_1)g_w $$

$$ v_n = \beta_1^{n-1}(1-\beta_1)g_w + … + (1-\beta_1)g_w $$

$$ v_n = (1-\beta_1)g_w (1+\beta_1 + … +\beta_1^{n-1}) $$

$$ v_n = (1-\beta_1)g_w \frac{1 – \beta_1^n}{1-\beta_1} =g_w (1-\beta_1^n)$$

Clearly, to amend $v_n$ be like $g_w$ , we need to divide it by $1-\beta_1^n$ , because of the polynomial ‘geometric’ form.

RMSprop

Apply a Transformation to the Gradient, $\nabla$ .

$$\text{Gradient Descent:} \quad w_t = w_{t-1} -\eta g_w$$

$$ \text{RMS:}\quad m_t = \beta_1 v_{t-1} + (1-\beta_2)g^2_w $$

In the Beginning of Iteration, $v_0=0$ , so we amend it to be $\hat{v_T}$ , (remember there is a power $t$ here).

$$ \hat{m_t} = \frac{m_t}{1-\beta_1^t} $$

,where $\beta_1$ assign weights between previous value and the gradient.

Add a square root of $\hat{m}_t$ in the denominator:

$$ w_t = w_{t-1} -\eta \frac{g_w}{\sqrt{\hat{m}_t}} $$

Adam

Adam is nearly the most powerful optimiser so far. Adam is like a combination of Momentum and RMS, both $\hat{m}$ and $\hat{v}$ are added to adjust the speed of approaching to the optimal points, $W^*$ , and $b^*$ .

$$ w_t = w_{t-1} – \eta \cdot \frac{ \hat{v}_t }{\sqrt{\hat{m}_t}} $$

Three main hyper parameters are $\eta = 0.01$ , $\beta_1 = 0.9$ , and $\beta_2 = 0.999$ .

Code

Optimiser Download

Deep Learning from Scratch

1. Perceptron

1.1 AND & OR & ‘N’ Gate

Linear Classification might be applied by AND & OR.

import numpy as np

def AND(x1, x2):
    x = np.array([x1,x2])
    w = np.array([0.5, 0.5])
    b = -0.7
    temp = np.sum(w*x )+ b
    if temp<= 0:
        return 0
    else:
        return 1
    
def NAND(x1,x2):
    x = np.array([x1, x2])
    w = np.array([-0.5, -0.5])
    b = 0.7
    temp = np.sum(w*x)+b
    if temp <= 0:
        return 0
    else: 
        return 1

def Nand(x1, x2): # same as above
    return int(not bool(AND(x1,x2)))
    
    
def OR(x1,x2):
    x = np.array([x1, x2])
    w = np.array([0.5, 0.5])
    b = -0.2
    temp = np.sum(w*x) + b
    if temp <=0:
        return 0
    else:
        return 1

1.2 XOR Gate – Apply More than Two Perceptrons to Achieve Non-Linear Classification

def XOR(x1, x2):
    s1 = NAND(x1, x2)
    s2 = OR(x1,x2)
    y = AND(s1, s2)
    return y

It has been proved that any functions could be represented by a combination of 2 perceptrons with sigmoid as the activation function.

Apply Multi – Layers of Perceptrons, we can achieve the simulation of any non-linear transformation. An important part is that Activation Function is necessary to be inserted into different layers. It can be easily proved that:

$$ W_2\Bigg( \big(W_1 X + b_1 \big) \Bigg) + b_2 \equiv W^* x + b^*$$

Multi-layer of Linear Transformation is still Linear Transformation. So, multi-layer of perceptrons becomes useless without Activation Function.

However, if activation function, in other word a non-linear transformation, is added between layers, then the neural network could mimic any non-linear function.

Therefore, what the Deep Learning is doing is to apply multi-layer perceptrons to mimic the pattern of a certain thing. Some key points are
- (1) multi-layer perceptrons are named the Neural Network. Each adjacent layers are doing linear transformation $WX + b$ and then apply a activation function, such as $Sigmoid(x) = \frac{1}{1+e^{-x}}$ ;
- (2) Update the weight matrix $W$ and $b$ , until the neural network could output an “optimal” result.

(1) How to define the layers and network structure, (2) how to evaluate the result is “Optimal” or not, and (3) how to find the weights are the main problems in Deep Learning.

2. Network Structure – Forward Propagation

Let’s start with the first Question. How to define the layers and network structure. Through the Network,

we input Data, $X$ , initially.
Data are transformed by ( Perceptrons – “Affine”, Activation Functions ) multi-times, which are the multi-layers.
Then, the results are passed through a Softmax function to reform results into percentages.
Finally, a Loss is calculate.

$$ x \to f(.) \to a(.) \to \text{…} \to Softmax(.) \to Loss(.)$$

, where $f(x) = xw + b$

, where $a(x)$ is the activation function.

, where $Loss(x)$ is the loss function.

2.1 Affine – Linear Transformation

$$ f(X) = X \cdot W +b $$

2.2 Activation Function

ReLU, $f(x) = max(x, 0$

Sigmoid, $f(x) = \frac{1}{1+e^{-x}}$

tanh

There are some properties of activation function, such as symmetric to the zero point, differentiable, etc. Those details are not discuss here.

2.3 Softmax

$$ \vec{X} = (x_1, …, x_i, ..) $$

$$ Softmax(\vec{X}) = \vec{Y} = (…, \frac{e^{x_i}}{\sum e^{x_i}} ,…)^T $$

Inputs are reform to be percentages, and those percentages are sum to be one.

P.S.

$$ y_k = \frac{e^{a_k}}{\sum_{i=1}^{n}e^{a_i} } = \frac{C e^{a_k}}{C \sum_{i=1}^{n}e^{a_i} } $$

$$ = \frac{e^{a_k+ lnC} }{\sum_{i=1}^{n}e^{a_i + lnC} } $$

$$ = \frac{e^{a_k – C’} }{\sum_{i=1}^{n}e^{a_i – C’} } $$

为了防止溢出，事先把x减去最大值。最大值是有效数据，其他值溢不溢出可管不了，也不关心。

2.4 Loss Function

Cross Entropy, $L = -\sum t_i \ ln(y_i) = – ln(\vec{Y}) \cdot \mathbf{t}^T$
MSE, See “Linear Regression Paddle” note. $E = \frac{1}{2} \sum_k (y_k – t_k)^2$

3. Minimise Loss & Update Weights

We consider the “Optimal” weights are such that they can result in minimum loss. So, in other words, we aim to find $w$ and $b$ that can minimise the loss function.

How to Do It ?

$$ arg\min_{w, b} Loss $$

$$ \hat{w}: \frac{\partial L}{\partial w} = 0 ,\quad \hat{b}: \frac{\partial L}{\partial b} = 0$$

Remember, the pathway:

$$ x \to f(.) \to a(.) \to \text{…} \to Softmax(.) \to Loss(.)$$

To solve the `F.O.C, we need to apply the Chain rule backward, which is called the Backward Propagation.

By Chain Rule,

$$ \frac{\partial Loss(w)}{\partial w} = \frac{\partial Loss(.)}{\partial Softmax} \cdot \frac{\partial Softmax(.)}{\partial a} \cdot \frac{\partial a}{\partial f} \cdot \frac{\partial f}{\partial a} … \cdot \frac{\partial f}{\partial w} $$

Let’s Decompose each Part in the Following.

3.1 Loss

$$ \frac{\partial Loss(\vec{Y}, \vec{t})}{\partial \vec{Y}} $$

Cross Entropy

$$ L = -\sum t_i \ ln(y_i) = – ln(\vec{Y}) \cdot \mathbf{t}^T$$

$$\frac{\partial L}{\partial y_i} = – \frac{t_i}{y_i}$$

$$\frac{\partial L}{\partial \vec{Y}} = – \big(…,\frac{t_i}{y_i},…\big)^T$$

3.2 Softmax

$$ \vec{X} = (x_1, …, x_i, ..) $$

$$ Softmax(\vec{X}) = \vec{Y} = (…, \frac{e^{x_i}}{\sum e^{x_i}} ,…)^T $$

$$ \frac{\partial Softmax(\vec{X})}{\partial \vec{X}} = Diag(\vec{Y}) – Y\cdot Y^T $$

Deviation is as the Following,

https://blog.csdn.net/Wild_Young/article/details/121912675

$\frac{\partial Y}{\partial X}$:

$$Softmax(x) = \frac{1}{1+e^{-x}}$$

Let $X$ be a vector with shape = (n,1), and
$Y = Softmax(X) $.

That means, for each element of $y_i$ in Y

$$ y_k = \frac{-e^{x_k}}{\sum_i^N e^{x_i}} $$

So,

$$ \frac{\partial Y}{\partial x_k} $$

If $j = i$ ,

$$ \frac{\partial y_j}{\partial x_i} = \frac{\partial }{\partial x_i} \Bigg( \frac{e^{x_i}}{\sum e^{x_i}} \Bigg) $$

$$ = \frac{ (e^{x_i})’\sum e^{x_i} – e^{x_i} (\sum e^{x_i})’ }{\big( \sum e^{x_i} \big)^2} $$

$$ =\frac{e^{x_i}}{\sum e^{x_i}} – \frac{e^{x_i}}{\sum e^{x_i}} \cdot \frac{e^{x_i}}{\sum e^{x_i}}$$

$$= y_i – y_i^2 =y_i(1-y_i)$$

If $j \neq i$ ,

$$ \frac{\partial y_j}{\partial x_i} = \frac{\partial }{\partial x_i} \Bigg( \frac{e^{x_j}}{\sum e^{x_i}} \Bigg) $$

$$ = \frac{-e^{x_i} e^{x_j}}{(\sum e^{x_i})^2} $$

$$ =- \frac{e^{x_j}}{\sum e^{x_i}} \cdot \frac{e^{x_i}}{\sum e^{x_i}} $$

$$ = -y_j y_i $$

$$ \frac{\partial y_j}{\partial x_i}=y_i – y_i^2 \quad \text{,if $i = j$} $$

$$\frac{\partial y_j}{\partial x_i} = -y_j y_i \quad \text{,if $i \neq j$} $$

Therefore, for $\frac{\partial Y}{\partial X}$ , we got the Jacobian matrix,

$$\frac{\partial \vec{Y}}{\partial \vec{X}} = Diag(Y) – Y^T Y$$

However, in the backward propagation of Softmax, we only need the diagonal, which is in the case of i = j.

$$ \frac{\partial y_i}{\partial x_i} = y_i-y_i^2 =y_i(1-y_i)$$

3.3 Loss (Cross-Entropy) and Softmax

There is a trick with the combination between cross-entropy-error and softmax.

The Cross-Entropy Loss function is,

, where $\vec{Y_{log}}^T$ is apply log to each element of $\vec{Y}$ , and then take transformation.

https://www.matrixcalculus.org

$$\frac{\partial L}{\partial \vec{Y}} = \vec{\mathbf{t}}^T \cdot Diag(Y_{1/Y}) $$

, where $Y_{1/Y}$ is $1/y_i$ for each element of $Y$ .

$$ \frac{\partial L}{\partial x_k } = \frac{\partial L}{\partial y_k}\cdot \frac{\partial y_k}{\partial x_k} $$

$$\frac{\partial L}{\partial x_k } =- \sum_i \frac{\partial }{\partial {y}_k} \bigg( t_i \ ln(y_i) \bigg) \cdot \frac{\partial y_k}{\partial x_k} $$

$$\frac{\partial L}{\partial x_k } =- \sum_i \frac{t_i }{{ y}_i} \cdot \frac{\partial {y}_i}{\partial x_k} $$

Plug in the derivatives of Softmax, $\frac{\partial y}{\partial x}$

$$\frac{\partial L}{\partial x_k } = – \frac{t_k }{{ y}_k}
\cdot \frac{\partial {y}_k}{\partial x_k} \sum_{i\neq k} \frac{t_i }{{ y}_i} \cdot \frac{\partial {y}_i}{\partial x_k} $$

$$ = -\frac{t_k }{{ y}_k}
\cdot {y}_k (1-{y}_k) \sum_{i\neq k} \frac{t_i }{{ y}_i} \cdot (- {y}_i {y}_k)
$$

$$ = – t_k (1-{y}_k) \sum_{i\neq k} t_i {y}_k
$$

$$ = – t_k \sum_{i} t_i \ {y}_k
$$

$$ = – t_k {y}k \sum{i} t_i \quad\text{by the tag, $t$, is sum to be 1}
$$

$$ \frac{\partial L}{\partial x_k } ={y}_k – t_k$$

So, we get $\frac{\partial L}{\partial x_k} ={y}_k – t_k$ .

3.4 Affine

$$ \frac{\partial f}{\partial w} $$

$$ \frac{\partial f}{\partial b} $$

Backward

$$ \frac{\partial f}{\partial X} =\cdot \ W^T \quad \text{, Right Multiplied}$$

$$ \frac{\partial f}{\partial W} = X^T \cdot \ \quad \text{, Left Multiplied}$$

$$ \frac{\partial f}{\partial b} = \mathbf{1}^T \cdot \quad \text{, Left Multiplied} $$

4. Other Theoretical Details

Batch: mini-batch and epoch for improving training accuracy.
Weights Initialisation: Xavier – Sigmoid and tank, He – ReLU
Weights Updating: SGD, Momentum, Adam, etc.
Overfitting: Batches, Regularisation, Dropout.
Hyperparameters: Bayes

See Chapter 6 of the book.

5. Code

See Attachment for the Code Realisation by purely Numpy.

Collection_of_Classes Download

Code and Books: http://82.157.170.111:1011/s/9D9BCBbCop6ERaD

Linear Regression Paddle for Beginner

try_ML_LinearReg Download

Experimentally decompose each components by Numpy.

CS231n and Deep Learning from Scratch

Great Starting Points of Deep Learning.

https://cs231n.github.io

Duality

$$min\ f_0(x), x \in \mathbb{R}^n$$

$$s.t. f_i(x)\leq 0, \text{for i from 1 to m}$$

$$ \quad h_i(x)=0, \text{for i from 1 to q}$$

That is, in Lagrangian form,

$$L(x,\lambda,\gamma)=f_0((x)+\sum \lambda_i f_i (x) +\sum \gamma_i h_i(x)$$

$$ \min_{x} \max_{\lambda,\gamma} L(x,\lambda, \gamma) $$

$$s.t. \lambda \geq 0$$

The Duality Problem is,

$$g(\lambda, \gamma) = \min_{x} L(x,\lambda, \gamma)$$

$$\max_{\lambda, \gamma} g(\lambda, \gamma) $$

$$s.t. \nabla_x L(x,\lambda, \gamma)=0$$

$$\quad \lambda \geq 0$$

Why Duality?

We change the original problem in to the duality, which becomes a convexity optimisation problem.

Convexity Optimisation

The object function is convex. (or the negative of a concavity)
The feasible set is a convex set.

See further study.

Kaggle Posts

Here below is what I was working on last month. I tried some simple application of machine learning algorithms. Codes could be found in my kaggle page below.

https://www.kaggle.com/eightsmile/code

Volatility Forecast – Ornstein Uhlenbeck Process

VolatilityForecast_OrnsteinUhlenbeck Download

See notes and the realisation.

We assume the volatility of returns follows a stochastic process. Define it as the Ornstein-Uhlenbeck process,

$$dX_t = \kappa (\theta – X_t)dt +\sigma \ dW_t$$

What’s the distribution function of the Ornstein-Uhlenbeck process?

We could apply the MLE estimation to find the parameter of the above random process.

Recall our Vasicek Form Ornstein-Uhlenbeck process is like the following:

$$ dX_t = \kappa (\theta – X_t)dt +\sigma \ dW_t $$

Multiply both sides by $e^{kt}$ , then we get,

$$ e^{kt}dX_t = \kappa e^{kt} \theta \ dt – \kappa e^{kt} X_t \ dt +\sigma e^{kt}\ dW_t $$

We know that $d( e^{kt} X_t )=e^{kt}dX_t + k e^{kt}X_t dt$ , and substitute it inside. Then, we get,

$$ d(e^{kt}X_t)=e^{kt}\theta \ dt + e^{kt}\sigma \ dW_t $$

Take an integral from [0,T],

$$ \int_0^T d(e^{kt}X_t)= \int_0^T e^{kt}\theta \ dt + \int_0^T e^{kt}\sigma \ dW_t $$

$$ X_T = X_0 e^{-kT} +\theta (1-e^{-kT}) + \int_0^T e^{-k(T-t)}\sigma \ dW_t $$

$\int_0^T e^{-k(T-t)}\sigma \ dW_t \sim N(0, \sigma^2\int_0^T e^{-2k(T-t)}dt)$

We then find $\mathbb{E}(X_T)$ and $Var(X_T)$ .

$ \mathbb{E}(X_T) =\mathbb{E}\bigg( X_0 e^{-kT} +\theta (1-e^{-kT}) + \int_0^T e^{-k(T-t)}\sigma \ dW_t \bigg)$

$ \mathbb{E}(X_T)= X_0 e^{-kT} +\theta (1-e^{-kT}) $

$Var(X_T) = \mathbb{E}\bigg( \big( X_T – \mathbb{E}(X_T) \big)^2 \bigg)$

$Var(X_T)= \mathbb{E}\bigg( \big( \int_0^T e^{-k(T-t)}\sigma \ dW_t \big)^2 \bigg) $

By Ito’s Isometry: $I(t)=\int_0^t \Delta(s)dW_s$ , then

$Var[I(t)]=\mathbb{E}[(I^2(t))]=\int_0^t \Delta^2(s)ds$

Then,

$ Var(X_T)= \int_0^T e^{-2k(T-t)}\sigma^2 \ dt $

$ Var(X_T)= \frac{\sigma^2}{2k}\big( 1-e^{-2kT} \big) $

Therefore, we finally get,

$ X_T \sim N\bigg(X_0 e^{-kT} +\theta (1-e^{-kT}), \frac{\sigma^2}{2k}\big( 1-e^{-2kT} \big) \bigg)$

MLE of Ornsterin-Uhlenbeck Process

$X_t \sim N(\cdot,\cdot)$

,where $\mathbb{E}(X_{t+\delta t})=X_0 e^{-k\ \delta t} +\theta (1-e^{-k\ \delta t})$ , and $Var(X_{t+\delta t})= \frac{\sigma^2}{2k}(1-e^{-2k\ \delta t})$ .

$$ f_{\theta}(x_{t+\delta t}|\theta)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

We replace $\mu = \mathbb{E}(X_{t+\delta t})$ and $\sigma^2 = Var(X_{t+\delta t})$ insider.

Code realisation could be found in notes.