Optimiser

In the Neural Network, we optimise (minimise) the loss by choosing weights matrix, W and b.

$$ arg\max_{W, b} Loss $$

By taking the F.O.C., we get the gradient, \nabla. Then, we update weights by minus the learning rate times the gradient.

Here, I wrote some test algorithms to find how optimisers evolve.

Preparation

Let’s make some preparations and notation clarifications.

Notation:

  • Y is True,
  • T is the Estimate.

$$Y = XW + b + e$$

Let T = WX +b

$$ loss = e^T e $$

$$ loss = (Y-T)^T \cdot (Y-T) $$

$$= (WX+b+e-T)^T \cdot (WX+b+e-T) $$

$$ \frac{\partial Loss}{\partial W} = X^T (WX+b+e-T) = 2X^T (Y-T) $$

$$ \frac{\partial loss }{\partial b} = 1^T (WX+b+e-T) = 1^T (Y-T)$$

Just One More Thing We Consider Here!

We do not want the number of sample to affect loss function’s value, so we take averge, instead of sumation.

$$ \frac{\partial Loss}{\partial W} = \frac{2}{N}\ X^T (Y-T) \quad, \quad \frac{\partial loss }{\partial b} = \frac{1}{N} \ 1^T (Y-T) $$

Optimisers

Gradient Descent

$$ W_t = W_{t-1} – \eta \nabla_W $$

$$ b_t = b_{t-1} – \eta \nabla_b $$

, where \eta is the learning rate.

However, we may find that the Gradient Descent might not satisfactory, because gradients might not be available or vanished sometimes. Also, the process of gradient descent largely depends on the HyperParameter \eta. Some other methods of optimisation are developed.

Different Optimisers aim to find the optimal weights more speedy and more accurate. In the followings are how optimisers are designed.

Stochastic Gradient Descent – SGD

In SGD, the stochastic part is added to avoid model train full dataset, and avoid resulting in the problem of overfitting.

Momentum SGD

A momentum term is added.

  • Apply Momentum to the Gradient, \nabla.

$$\text{Gradient Descent:} \quad w_t = w_{t-1} -\eta g_w$$

$$ \text{Momentum:}\quad v_t = \beta_1 v_{t-1} + (1-\beta_1)g_w $$

In the Beginning of Iteration, v_0=0, so we amend it to be \hat{v_T}

$$ \hat{v_t} = \frac{v_t}{1-\beta_1^t} $$

,where \beta_1 assign weights between previous value and the gradient.

Replace the gradient g_w by the amended momentum term \hat{v}_t:

$$ w_t = w_{t-1} -\eta \hat{v}_t $$

P.S. Why we amend v_t by dividing 1 + \beta^t.

As v_t = \beta_1 v_{t-1} + (1-\beta_1)g_w, which is like a geometric decaying polynomial function. The difference is that w and g_w keep updating each period. Let’s assume there is no updating anymore, or in other word, model has converge. g_w = constant. \{ v_t \}_{t=0}^T would be,

$$ v_0 = 0 $$

$$ v_1 = (1-\beta_1)g_w $$

$$ v_2 = \beta_1 (1-\beta_1)g_w + (1-\beta_1) g_w $$

$$ v_3 = \beta_1^2(1-\beta_1)g_w + \beta_1 (1-\beta_1)g_w + (1-\beta_1)g_w $$

$$ v_n = \beta_1^{n-1}(1-\beta_1)g_w + … + (1-\beta_1)g_w $$

$$ v_n = (1-\beta_1)g_w (1+\beta_1 + … +\beta_1^{n-1}) $$

$$ v_n = (1-\beta_1)g_w \frac{1 – \beta_1^n}{1-\beta_1} =g_w (1-\beta_1^n)$$

Clearly, to amend v_n be like g_w, we need to divide it by 1-\beta_1^n, because of the polynomial ‘geometric’ form.

RMSprop

  • Apply a Transformation to the Gradient, \nabla.

$$\text{Gradient Descent:} \quad w_t = w_{t-1} -\eta g_w$$

$$ \text{RMS:}\quad m_t = \beta_1 v_{t-1} + (1-\beta_2)g^2_w $$

In the Beginning of Iteration, v_0=0, so we amend it to be \hat{v_T}, (remember there is a power t here).

$$ \hat{m_t} = \frac{m_t}{1-\beta_1^t} $$

,where \beta_1 assign weights between previous value and the gradient.

Add a square root of \hat{m}_t in the denominator:

$$ w_t = w_{t-1} -\eta \frac{g_w}{\sqrt{\hat{m}_t}} $$

Adam

Adam is nearly the most powerful optimiser so far. Adam is like a combination of Momentum and RMS, both \hat{m} and \hat{v} are added to adjust the speed of approaching to the optimal points, W^*, and b^*.

$$ w_t = w_{t-1} – \eta \cdot \frac{ \hat{v}_t }{\sqrt{\hat{m}_t}} $$

Three main hyper parameters are \eta = 0.01, \beta_1 = 0.9, and \beta_2 = 0.999.

Code

Deep Learning from Scratch

1. Perceptron

1.1 AND & OR & ‘N’ Gate

Linear Classification might be applied by AND & OR.

import numpy as np

def AND(x1, x2):
    x = np.array([x1,x2])
    w = np.array([0.5, 0.5])
    b = -0.7
    temp = np.sum(w*x )+ b
    if temp<= 0:
        return 0
    else:
        return 1
    
def NAND(x1,x2):
    x = np.array([x1, x2])
    w = np.array([-0.5, -0.5])
    b = 0.7
    temp = np.sum(w*x)+b
    if temp <= 0:
        return 0
    else: 
        return 1

def Nand(x1, x2): # same as above
    return int(not bool(AND(x1,x2)))
    
    
def OR(x1,x2):
    x = np.array([x1, x2])
    w = np.array([0.5, 0.5])
    b = -0.2
    temp = np.sum(w*x) + b
    if temp <=0:
        return 0
    else:
        return 1

1.2 XOR Gate – Apply More than Two Perceptrons to Achieve Non-Linear Classification

def XOR(x1, x2):
    s1 = NAND(x1, x2)
    s2 = OR(x1,x2)
    y = AND(s1, s2)
    return y
  • It has been proved that any functions could be represented by a combination of 2 perceptrons with sigmoid as the activation function.

Apply Multi – Layers of Perceptrons, we can achieve the simulation of any non-linear transformation. An important part is that Activation Function is necessary to be inserted into different layers. It can be easily proved that:

$$ W_2\Bigg( \big(W_1 X + b_1 \big) \Bigg) + b_2 \equiv W^* x + b^*$$

Multi-layer of Linear Transformation is still Linear Transformation. So, multi-layer of perceptrons becomes useless without Activation Function.

However, if activation function, in other word a non-linear transformation, is added between layers, then the neural network could mimic any non-linear function.

  • Therefore, what the Deep Learning is doing is to apply multi-layer perceptrons to mimic the pattern of a certain thing. Some key points are
    • (1) multi-layer perceptrons are named the Neural Network. Each adjacent layers are doing linear transformation WX + b and then apply a activation function, such as Sigmoid(x) = \frac{1}{1+e^{-x}};
    • (2) Update the weight matrix W and b, until the neural network could output an “optimal” result.
  • (1) How to define the layers and network structure, (2) how to evaluate the result is “Optimal” or not, and (3) how to find the weights are the main problems in Deep Learning.

2. Network Structure – Forward Propagation

Let’s start with the first Question. How to define the layers and network structure. Through the Network,

  1. we input Data, X, initially.
  2. Data are transformed by ( Perceptrons – “Affine”, Activation Functions ) multi-times, which are the multi-layers.
  3. Then, the results are passed through a Softmax function to reform results into percentages.
  4. Finally, a Loss is calculate.

$$ x \to f(.) \to a(.) \to \text{…} \to Softmax(.) \to Loss(.)$$

, where f(x) = xw + b

, where a(x) is the activation function.

, where Loss(x) is the loss function.

2.1 Affine – Linear Transformation

$$ f(X) = X \cdot W +b $$

2.2 Activation Function

  • ReLU, f(x) = max(x, 0
  • Sigmoid, f(x) = \frac{1}{1+e^{-x}}
  • tanh

There are some properties of activation function, such as symmetric to the zero point, differentiable, etc. Those details are not discuss here.

2.3 Softmax

$$ \vec{X} = (x_1, …, x_i, ..) $$

$$ Softmax(\vec{X}) = \vec{Y} = (…, \frac{e^{x_i}}{\sum e^{x_i}} ,…)^T $$

Inputs are reform to be percentages, and those percentages are sum to be one.

P.S.

$$ y_k = \frac{e^{a_k}}{\sum_{i=1}^{n}e^{a_i} } = \frac{C e^{a_k}}{C \sum_{i=1}^{n}e^{a_i} } $$

$$ = \frac{e^{a_k+ lnC} }{\sum_{i=1}^{n}e^{a_i + lnC} } $$

$$ = \frac{e^{a_k – C’} }{\sum_{i=1}^{n}e^{a_i – C’} } $$

为了防止溢出,事先把x减去最大值。最大值是有效数据,其他值溢不溢出可管不了,也不关心。

2.4 Loss Function

  • Cross Entropy, L = -\sum t_i \ ln(y_i) = – ln(\vec{Y}) \cdot \mathbf{t}^T
  • MSE, See “Linear Regression Paddle” note. E = \frac{1}{2} \sum_k (y_k – t_k)^2

3. Minimise Loss & Update Weights

We consider the “Optimal” weights are such that they can result in minimum loss. So, in other words, we aim to find w and b that can minimise the loss function.

How to Do It ?

$$ arg\min_{w, b} Loss $$

$$ \hat{w}: \frac{\partial L}{\partial w} = 0 ,\quad \hat{b}: \frac{\partial L}{\partial b} = 0$$

Remember, the pathway:

$$ x \to f(.) \to a(.) \to \text{…} \to Softmax(.) \to Loss(.)$$

To solve the `F.O.C, we need to apply the Chain rule backward, which is called the Backward Propagation.

By Chain Rule,

$$ \frac{\partial Loss(w)}{\partial w} = \frac{\partial Loss(.)}{\partial Softmax} \cdot \frac{\partial Softmax(.)}{\partial a} \cdot \frac{\partial a}{\partial f} \cdot \frac{\partial f}{\partial a} … \cdot \frac{\partial f}{\partial w} $$

Let’s Decompose each Part in the Following.

3.1 Loss

$$ \frac{\partial Loss(\vec{Y}, \vec{t})}{\partial \vec{Y}} $$

  • Cross Entropy

$$ L = -\sum t_i \ ln(y_i) = – ln(\vec{Y}) \cdot \mathbf{t}^T$$

$$\frac{\partial L}{\partial y_i} = – \frac{t_i}{y_i}$$

$$\frac{\partial L}{\partial \vec{Y}} = – \big(…,\frac{t_i}{y_i},…\big)^T$$

3.2 Softmax

$$ \vec{X} = (x_1, …, x_i, ..) $$

$$ Softmax(\vec{X}) = \vec{Y} = (…, \frac{e^{x_i}}{\sum e^{x_i}} ,…)^T $$

$$ \frac{\partial Softmax(\vec{X})}{\partial \vec{X}} = Diag(\vec{Y}) – Y\cdot Y^T $$


Deviation is as the Following,

https://blog.csdn.net/Wild_Young/article/details/121912675

$\frac{\partial Y}{\partial X}$:

$$Softmax(x) = \frac{1}{1+e^{-x}}$$

Let X be a vector with shape = (n,1), and
$Y = Softmax(X) $.

That means, for each element of y_i in Y

$$ y_k = \frac{-e^{x_k}}{\sum_i^N e^{x_i}} $$

So,

$$ \frac{\partial Y}{\partial x_k} $$

If j = i,

$$ \frac{\partial y_j}{\partial x_i} = \frac{\partial }{\partial x_i} \Bigg( \frac{e^{x_i}}{\sum e^{x_i}} \Bigg) $$

$$ = \frac{ (e^{x_i})’\sum e^{x_i} – e^{x_i} (\sum e^{x_i})’ }{\big( \sum e^{x_i} \big)^2} $$

$$ =\frac{e^{x_i}}{\sum e^{x_i}} – \frac{e^{x_i}}{\sum e^{x_i}} \cdot \frac{e^{x_i}}{\sum e^{x_i}}$$

$$= y_i – y_i^2 =y_i(1-y_i)$$

If j \neq i,

$$ \frac{\partial y_j}{\partial x_i} = \frac{\partial }{\partial x_i} \Bigg( \frac{e^{x_j}}{\sum e^{x_i}} \Bigg) $$

$$ = \frac{-e^{x_i} e^{x_j}}{(\sum e^{x_i})^2} $$

$$ =- \frac{e^{x_j}}{\sum e^{x_i}} \cdot \frac{e^{x_i}}{\sum e^{x_i}} $$

$$ = -y_j y_i $$

$$ \frac{\partial y_j}{\partial x_i}=y_i – y_i^2 \quad \text{,if $i = j$} $$


$$\frac{\partial y_j}{\partial x_i} = -y_j y_i \quad \text{,if $i \neq j$} $$

Therefore, for \frac{\partial Y}{\partial X}, we got the Jacobian matrix,

$$\frac{\partial \vec{Y}}{\partial \vec{X}} = Diag(Y) – Y^T Y$$

However, in the backward propagation of Softmax, we only need the diagonal, which is in the case of i = j.

$$ \frac{\partial y_i}{\partial x_i} = y_i-y_i^2 =y_i(1-y_i)$$

3.3 Loss (Cross-Entropy) and Softmax

There is a trick with the combination between cross-entropy-error and softmax.

The Cross-Entropy Loss function is,

, where \vec{Y_{log}}^T is apply log to each element of \vec{Y}, and then take transformation.

https://www.matrixcalculus.org

$$\frac{\partial L}{\partial \vec{Y}} = \vec{\mathbf{t}}^T \cdot Diag(Y_{1/Y}) $$

, where Y_{1/Y} is 1/y_i for each element of Y.

$$ \frac{\partial L}{\partial x_k } = \frac{\partial L}{\partial y_k}\cdot \frac{\partial y_k}{\partial x_k} $$

$$\frac{\partial L}{\partial x_k } =- \sum_i \frac{\partial }{\partial {y}_k} \bigg( t_i \ ln(y_i) \bigg) \cdot \frac{\partial y_k}{\partial x_k} $$

$$\frac{\partial L}{\partial x_k } =- \sum_i \frac{t_i }{{ y}_i} \cdot \frac{\partial {y}_i}{\partial x_k} $$

Plug in the derivatives of Softmax, \frac{\partial y}{\partial x}

$$\frac{\partial L}{\partial x_k } = – \frac{t_k }{{ y}_k}
\cdot \frac{\partial {y}_k}{\partial x_k} \sum_{i\neq k} \frac{t_i }{{ y}_i} \cdot \frac{\partial {y}_i}{\partial x_k} $$

$$ = -\frac{t_k }{{ y}_k}
\cdot {y}_k (1-{y}_k) \sum_{i\neq k} \frac{t_i }{{ y}_i} \cdot (- {y}_i {y}_k)
$$

$$ = – t_k (1-{y}_k) \sum_{i\neq k} t_i {y}_k
$$

$$ = – t_k \sum_{i} t_i \ {y}_k
$$

$$ = – t_k {y}k \sum{i} t_i \quad\text{by the tag, $t$, is sum to be 1}
$$

$$ \frac{\partial L}{\partial x_k } ={y}_k – t_k$$

So, we get \frac{\partial L}{\partial x_k} ={y}_k – t_k.

3.4 Affine

$$ \frac{\partial f}{\partial w} $$

$$ \frac{\partial f}{\partial b} $$

  • Backward

$$ \frac{\partial f}{\partial X} =\cdot \ W^T \quad \text{, Right Multiplied}$$

$$ \frac{\partial f}{\partial W} = X^T \cdot \ \quad \text{, Left Multiplied}$$

$$ \frac{\partial f}{\partial b} = \mathbf{1}^T \cdot \quad \text{, Left Multiplied} $$

4. Other Theoretical Details

  1. Batch: mini-batch and epoch for improving training accuracy.
  2. Weights Initialisation: Xavier – Sigmoid and tank, He – ReLU
  3. Weights Updating: SGD, Momentum, Adam, etc.
  4. Overfitting: Batches, Regularisation, Dropout.
  5. Hyperparameters: Bayes

See Chapter 6 of the book.

5. Code

See Attachment for the Code Realisation by purely Numpy.

Code and Books: http://82.157.170.111:1011/s/9D9BCBbCop6ERaD

Duality

$$min\ f_0(x), x \in \mathbb{R}^n$$

$$s.t. f_i(x)\leq 0, \text{for i from 1 to m}$$

$$ \quad h_i(x)=0, \text{for i from 1 to q}$$

  • That is, in Lagrangian form,

$$L(x,\lambda,\gamma)=f_0((x)+\sum \lambda_i f_i (x) +\sum \gamma_i h_i(x)$$

$$ \min_{x} \max_{\lambda,\gamma} L(x,\lambda, \gamma) $$

$$s.t. \lambda \geq 0$$

  • The Duality Problem is,

$$g(\lambda, \gamma) = \min_{x} L(x,\lambda, \gamma)$$

$$\max_{\lambda, \gamma} g(\lambda, \gamma) $$

$$s.t. \nabla_x L(x,\lambda, \gamma)=0$$

$$\quad \lambda \geq 0$$

Why Duality?

We change the original problem in to the duality, which becomes a convexity optimisation problem.

Convexity Optimisation

  • The object function is convex. (or the negative of a concavity)
  • The feasible set is a convex set.

See further study.

KKT for Optimisation

Optimisation

Suppose f, g_1, g_2, …, g_m are differentiable.

$$ \max_{x_1, x_2,…, x_n} f(x_1, x_2, …, x_n) $$

$$s.t. g_1(x_1, x_2, …, x_n)\geq 0 $$

$$\quad g_2(x_1, x_2, …, x_n)\geq 0 $$

$$\quad ……$$

$$\quad g_m(x_1, x_2, …, x_n)\geq 0 $$

There are m constraints g_i, i \in (1,m), and one object function f to maximum. n variables.

KKT Conditions

Suppose the constraint qualification is satisfied. A local maximiser of the above problem \bar{x} = (\bar{x_1}, \bar{x_2},… \bar{x_n}) together with some \bar{lambda_1}, \bar{lambda_2}, …, \bar{lambda_m} \geq 0 satisfies the following:

$$ \frac{\partial f}{\partial x_i}(\bar{x}) + \lambda_1 \frac{\partial g_1}{\partial x_i}(\bar{x})+ \lambda_2 \frac{\partial g_2}{\partial x_i}(\bar{x}) +…+ \lambda_{m} \frac{\partial g_m}{\partial x_i}(\bar{x}) = 0$$

for i=1,2,…,n, and

$$\bar{\lambda_k} g_k(\bar{x})=0$$

for k=1,2,…,m

  • We have state “Suppose the constraint qualification is satisfied” in the beginning. What is the qualification?

See step 4 in the next section.

Solve the Maximisation Problem

  • Step 1. Step up the Lagrangian
  • Step 2. Write down the KKT conditions and constraints.
  • Step 3. Solve the equation set. (P.S. Discuss the complementary slackness, and confirm the possible solutions)
  • Step 4. Check constraint qualification.
  • Step 4.

Constraint Set, C=\{ (x_1, x_2, …, x_n)\in \mathbb{R}^n | g_k(x_1, x_2, …, x_m) \geq 0 \} for k=1,2,…,m.

The constraint qualification is satisfied at \bar{x} in C, if the constraints that hold at \bar{x} with equality are independent. That is, if the m vectors

$$ \nabla g_k(\bar{x})=\bigg( \frac{\partial g_k}{\partial x_1}(\bar{x}), \frac{\partial g_k}{\partial x_2}(\bar{x}),…,\frac{\partial g_k}{\partial x_n}(\bar{x}) \bigg)’ $$

for k=1,2,…,m are independent.

Black and Scholes, 1973

See A bit Stochastic Calculus .

For,

$$ dS_t = \mu S_t \ dt +\sigma S_t \ dW_t $$

  • In calculating d f(S_t), we would get, (by Taylor Expansion)

$$ df(S_t) = \bigg( \frac{\partial f}{\partial t} + \frac{\partial f}{\partial S_t}\mu S_t +\frac{1}{2}\frac{\partial^2 f}{\partial S_t^2}\sigma^2 S_t^2 \bigg)dt + \frac{\partial f}{\partial S_t}\sigma S_t \ dW_t $$

  • A Special Form of f(\cdot) is f(S) = log(S),

$$ d\ log(S_t) = \bigg( \mu – \frac{1}{2}\sigma^2 \bigg)dt + \sigma \ dW_t $$


We get Y_t = log S_t is the price of a derivative security with respect to S_t and t and then,

$$ dY_t= \bigg( \frac{\partial Y_t}{\partial t} + \frac{\partial Y_t}{\partial S_t}\mu S_t +\frac{1}{2}\frac{\partial^2 Y_t}{\partial S_t^2}\sigma^2 S_t^2 \bigg)dt + \frac{\partial Y_t}{\partial S_t}\sigma S_t \ dW_t $$

Consider a portfolio \Pi is constructed with (1) short one derivative, and (2) long some fraction of stocks, \Delta, such that the portfolio is risk natural. (\Delta = \frac{\partial Y}{\partial S})

$$ \Pi_t = -Y +\Delta \ S_t $$

Differentiate it,

$$ d\Pi_t = -dY_t +\frac{dY}{dS}dS_t $$

Subtitute dY_t and dS_t into the above equation, we would then get the stochastic process of portfolio, by Ito’s Lemma.

$$ d\Pi_t =-\bigg( \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 \bigg) dt $$

The diffusion term dW_t disappears, and that means the portfolio is riskless during the interval dt. Under a no arbitrage assumption, this portfolio can only earn the riskless return, r.

$$ d\Pi_t =r\Pi_t \ dt $$

  • Subtitute d\Pi_t and \Pi_t into, we would get the Partial Differential Equation (PDE) / Black-Scholes equation,

$$ – \bigg( \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 \bigg) dt = r\bigg(- Y_t + \frac{\partial Y}{\partial S}S_t \bigg)dt $$

$$ rY_t = \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 + \frac{\partial Y}{\partial S}S_t $$

Then, we guess (where U(.) is a function of S_t at time t=T),

$$ Y_t = e^{-r(T-t)} U(S_T) $$

For a European call with strike price, K, U(S_T) would be the payoff at maturity,

$$ U(S_T) = max( S – K , 0 ) $$

Finally, through a series of process to find a specific solution of the PDE, we can solve the value of call, (\Phi(\cdot) is the cumulative standard normal distribtion)

$$ c = S_t\Phi(d_1) – K e^{-r (T-t)}\Phi(d_2) $$

with,

$$ d_1 =\frac{ log(S_t/K) + (r-\frac{1}{2}\sigma^2)(T-t)}{\sigma \sqrt{T-t}} $$

$$ d_1 = \sigma \sqrt{T-t} $$

Volatility Forecast – Ornstein Uhlenbeck Process

See notes and the realisation.

We assume the volatility of returns follows a stochastic process. Define it as the Ornstein-Uhlenbeck process,

$$dX_t = \kappa (\theta – X_t)dt +\sigma \ dW_t$$

What’s the distribution function of the Ornstein-Uhlenbeck process?

We could apply the MLE estimation to find the parameter of the above random process.

Recall our Vasicek Form Ornstein-Uhlenbeck process is like the following:

$$ dX_t = \kappa (\theta – X_t)dt +\sigma \ dW_t $$

Multiply both sides by e^{kt}, then we get,

$$ e^{kt}dX_t = \kappa e^{kt} \theta \ dt – \kappa e^{kt} X_t \ dt +\sigma e^{kt}\ dW_t $$

We know that d( e^{kt} X_t )=e^{kt}dX_t + k e^{kt}X_t dt, and substitute it inside. Then, we get,

$$ d(e^{kt}X_t)=e^{kt}\theta \ dt + e^{kt}\sigma \ dW_t $$

Take an integral from [0,T],

$$ \int_0^T d(e^{kt}X_t)= \int_0^T e^{kt}\theta \ dt + \int_0^T e^{kt}\sigma \ dW_t $$

$$ X_T = X_0 e^{-kT} +\theta (1-e^{-kT}) + \int_0^T e^{-k(T-t)}\sigma \ dW_t $$

$\int_0^T e^{-k(T-t)}\sigma \ dW_t \sim N(0, \sigma^2\int_0^T e^{-2k(T-t)}dt)$

We then find \mathbb{E}(X_T) and Var(X_T).

$ \mathbb{E}(X_T) =\mathbb{E}\bigg( X_0 e^{-kT} +\theta (1-e^{-kT}) + \int_0^T e^{-k(T-t)}\sigma \ dW_t \bigg)$

$ \mathbb{E}(X_T)= X_0 e^{-kT} +\theta (1-e^{-kT}) $

$Var(X_T) = \mathbb{E}\bigg( \big( X_T – \mathbb{E}(X_T) \big)^2 \bigg)$

$Var(X_T)= \mathbb{E}\bigg( \big( \int_0^T e^{-k(T-t)}\sigma \ dW_t \big)^2 \bigg) $

By Ito’s Isometry: I(t)=\int_0^t \Delta(s)dW_s, then

$Var[I(t)]=\mathbb{E}[(I^2(t))]=\int_0^t \Delta^2(s)ds$

Then,

$ Var(X_T)= \int_0^T e^{-2k(T-t)}\sigma^2 \ dt $

$ Var(X_T)= \frac{\sigma^2}{2k}\big( 1-e^{-2kT} \big) $

Therefore, we finally get,

$ X_T \sim N\bigg(X_0 e^{-kT} +\theta (1-e^{-kT}), \frac{\sigma^2}{2k}\big( 1-e^{-2kT} \big) \bigg)$

MLE of Ornsterin-Uhlenbeck Process

$X_t \sim N(\cdot,\cdot)$

,where \mathbb{E}(X_{t+\delta t})=X_0 e^{-k\ \delta t} +\theta (1-e^{-k\ \delta t}), and Var(X_{t+\delta t})= \frac{\sigma^2}{2k}(1-e^{-2k\ \delta t}).

$$ f_{\theta}(x_{t+\delta t}|\theta)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

We replace \mu = \mathbb{E}(X_{t+\delta t}) and \sigma^2 = Var(X_{t+\delta t}) insider.

Code realisation could be found in notes.

Maximum Likelihood Estimator

Joint Probability Density Function or Likelihood Function:

$f(x_1, x_2, …, x_n|\theta) = f(x_1|\theta) f(x_2|\theta) … f(x_N| \theta) $

$f(x_1, x_2, …, x_n|\theta) = \prod_{i=2}^n f(X_i | \theta) =L(\theta)$

A likelihood function is the density function regarded as a function of \theta.

$ L(\theta |x) = \prod f(x|\theta) $, $\theta \in \Theta$

As we know the sample, we need to ‘guess’ the parameter \theta to let the joint prob distribution have a maximum probability.

The Maximum Likelihood Estimator (MLE) is :

$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) $

To simplify the calculation, we normally apply a logarithm transformation,

$$ l(\theta)= log\ L $$

$$log L(\theta |x) =log \prod f(x|\theta)= \sum_{i=1}^N log\ f_{\theta} (y_i) $$

As the logarithm transformation is monotonous,

$$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) \Leftrightarrow $$

$$\hat{\theta}(x) = arg\max_{\theta} l(\theta|x) $$

To review it, we aim to estimate the parameters. The logic of the MLE estimator is that we guess a probability distribution, fitting into our data (sample). Then, we find or guess the value of parameters such that the joint probability function (likelihood function) achieves the maximum value.

In another word, we guess the probability dist and parameters to make sample data achieve highest possibility.

To find the argmax_{\theta} l(\theta) is equivalent to argmin_{\theta} -l(\theta)

A Normal Distribution Example

A normal distribution example,for random sample x_i, i=1, 2, 3, …, N

$$f_{\theta}(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

We substitute f_{\theta}(x) into the log-likelihood distribution, then we find,

$$ l(\mu, \sigma^2)=log\ L(\mu,\sigma^2) $$

$$= -\frac{n}{2}\big( log2\pi + log\sigma^2 \big) -\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i -\mu)^2 $$

Taking the First Order Conditions, F.O.C.

By setting \frac{\partial l}{\partial \mu}= 0 and \frac{\partial l}{\partial \sigma^2}= 0, we solve \mu_{MLE} and \sigma^2_{MLE}.

$$ \mu_{MLE} =\bar{x}$$

$$\sigma^2_{MLE} =\frac{1}{n} \sum_{i=1}^n (x_i – \bar{x})$$

P.S. We can prove that there are the local maximum, coz the second partial derivatives are negative.

def MLE_norm(x):
    mu_hat = np.mean(x)
    sigma2_hat = np.var(x)
    return mu_hat, sigma2_hat

#----------

mu = 5
sigma = 2.5
N = 10000

np.random.seed(0)
x = np.random.normal( mu, sigma, size=(N,) )

#----------

mu_hat, sigma2_hat = MLE_norm(x)

for_mu_hat = '\hat{\mu} = '+format(round(mu_hat,2))+''
for_sigma2_hat = '\hat{\sigma} = '+format(round(np.sqrt(sigma2_hat),2))+''
print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))

By our calculation, we know that mathematically the MLE estimator of mu and sigma are the mean and variance.

Perform MLE Numerically

If the log-likelihood function wasn’t continuous or differentiable. Can solve numerically through an optimisation problem where the objective function is …

In other words, sometimes we cannot calculate the MLE estimator by hand, because the log-likelihood function might not be differentiable. What can we do?

We apply the computer to interactively ‘guess’ values (Optimizer) that can end up with the maximum log-likelihood (or a minimum of negative log-likelihood).

The maximum likelihood estimator (MLE):

$$\hat{\theta}(x) = arg max_{\theta} L(\theta|x)$$

$$\hat{\theta}(x) = arg max_{\theta} L(\theta|x)$$

$$L(\theta) = \prod_{i=1}^N f(X_i | \theta) $$

$$ log\ f(X|\theta)=\sum_{i=1}^N log\ f(x_i|\theta)$$

$$ arg\max_{\theta} l(\theta) \Leftrightarrow arg\min_{\theta} -l(\theta) $$

Why we use the minimizer instead? There is only minimizer available in the scipy package, and those mini&max are basically equivalent.

def log_likelihood(theta, x):
    mu = theta[0]
    sigma = theta[1]
    l_theta = np.sum( np.log( sc.stats.norm.pdf(x, loc=mu, scale=sigma) ) )
    # there is a negative sign in front. We minimise it later.
    return -l_theta 

# Constraint Function that Restrict sigma to be positive.
def sigma_pos(theta):
    sigma = np.abs(theta[1])
    return sigma

#--------

cons_set = {'type':'ineq', 'fun': sigma_pos}

theta0 = [2,3] # inital guess. there could be any random numbers, coz it is just the intital value inputed into the optimiser
# find the minimum value of the log likelihood function
opt = sc.optimize.minimize(fun=log_likelihood, x0=theta0, args=(x,), constraints=cons_set)

for_mu_hat = '\hat{\mu} = '+format(round(opt.x[0],2))+''
for_sigma2_hat = '\hat{\sigma} = '+format(round(opt.x[1],2))+''

print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))