Math Tools – Fanyu Zhao

Least Squares Method – Intro to Kalman Filter

Consider a Linear Equation,

$$ y_i = \sum_{j=1}^n C_{i,j} x_j +v_i,\quad i=1,2,…$$

, where $C_{i,j}$ are scalars and $v_i\in \mathbb{R}$ is the measurement noise. The noise is unknown, while we assume it follows certain patterns (the assumptions are due to some statistical properties of the noise). We assume $v_i, v_j$ are independent for $i\neq j$ . Properties are mean of zero, and variance equals sigma squared.

$$\mathbb{E}(v_i)=0$$

$$\mathbb{E}(v_i^2) = \sigma_i^2$$

We can rewrite $y_i = \sum_{j=1}^n C_{i,j} x_j +v_i$ as,

$$ \begin{pmatrix} y_1 \ y_2 \ \vdots\ y_s\end{pmatrix} = \begin{pmatrix} C_{11} & C_{12} & \cdots & C_{1n} \ C_{21} & C_{22}& \cdots & C_{2n} \ \vdots & \vdots & \cdots & \vdots \ C_{s1} & C_{s2} & \cdots & C_{sn}\end{pmatrix} \begin{pmatrix} x_1 \ x_2 \ \vdots\ x_n\end{pmatrix} + \begin{pmatrix} v_1 \ v_2 \ \vdots\ v_s\end{pmatrix} $$

, in a matrix form,

$$ \vec{y} = C \vec{x} + \vec{v} $$

, but I would write in a short form,

$$ y= C x +v$$

We solve for the least squared estimator from the optimisation problem, (there is a squared L2 norm)

$$ \min_x || y-Cx ||_2^2 $$

Recursive Least Squared Method

The classic least squared estimator might not work well when data evolving. So, there emerges a Recursive Least Squared Method to deal with the discrete-time instance. Let’s say, for a discrete-time instance $k$ , $y_k \in \mathbb{R}’$ is within a set of measurements group follows,

$$y_k = C_k x + v_k$$

, where $C_k \in \mathbb{R}^{l\times n}$ , and $v_k \in \mathbb{R}^l$ is the measurement noise vector. We assume that the covariance of the measurement noise is given by,

$$ \mathbb{E}[v_k v_k^T] = R_k$$

, and

$$\mathbb{E}[v_k]=0$$

The recursive least squared method has the following form in this section,

$$\hat{x}k = \hat{x}{k-1} + K_k (y_k – C_k \hat{x}_{k-1})$$

, where $\hat{x}k$ and $\hat{x}{k-1}$ are the estimates of the vector $x$ at the discrete-time instants $k$ and $k-1$ , and $K_k \in \mathbb{R}^{n\times l}$ is the gain matrix that we need to determine. $K_k$ is coined the ‘Gain Matrix’

The above equation updates the estimate of $x$ at the time instant $k$ on the basis of the estimate $\hat{x}_{k-1}$ at the previous time instant $k-1$ and on the basis of the measurement $y_k$ obtained at the time instant $k$ , as well as on the basis of the gain matrix $K_k$ computed at the time instant $k$ .

Notation

$\hat{x}$ is the estimate.

$$ \hat{x}k = \begin{pmatrix} \hat{x}{1,k} \ \hat{x}{2,k} \ \vdots \\hat{x}{n,k} \end{pmatrix} $$

, which is corresponding with the true vector $x$ .

$$x = \begin{pmatrix} x_1 \ x_2 \ \vdots \ x_n \end{pmatrix}$$

The estimation error, $\epsilon_{i,k} = x_i – \hat{x}_{i,k} \quad i=1,2,…,n$ .

$$\epsilon_k = \begin{pmatrix} \epsilon_{1,k} \ \epsilon_{2,k} \ \vdots \\epsilon_{n,k} \end{pmatrix} = x – \hat{x}_k = \begin{pmatrix} x_1-\hat{x}_{1,k} \ x_2 – \hat{x}_{2,k} \ \vdots \x_n-\hat{x}_{n,k} \end{pmatrix} $$

The gain $K_k$ is computed by minimising the sum of variances of the estimation errors,

$$ W_k = \mathbb{E}(\epsilon_{1,k}^2) + \mathbb{E}(\epsilon_{2,k}^2) + \cdots + \mathbb{E}(\epsilon_{n,k}^2) $$

Next, let’s show the cost function could be represented as follows, ( $tr(.)$ is the trace of a matrix)

$$ W_k = tr(P_k) $$

, and $P_k$ is the estimation error covariance matrix defined by

$$ P_k = \mathbb{E}(\epsilon_k \epsilon_k^T )$$

Or, says,

$$ K_k = arg\min_{K_k} W_k = tr\bigg( \mathbb{E}(\epsilon_k \epsilon_k^T ) \bigg)$$

Why is that?

$$\epsilon_k \epsilon_k^T = \begin{pmatrix} \epsilon_{1,k} \ \epsilon_{2,k} \\vdots \ \epsilon_{n,k} \end{pmatrix} \begin{pmatrix} \epsilon_{1,k} & \epsilon_{2,k} & \cdots & \epsilon_{n,k} \end{pmatrix}$$

$$ = \begin{pmatrix} \epsilon_{1,k}^2 & \cdots & \epsilon_{1,k}\epsilon_{n,k} \ \vdots & \epsilon_{i,k}^2 & \vdots \ \epsilon_{1,k}\epsilon_{n,k} & \cdots & \epsilon_{n,k}^2\end{pmatrix} $$

So,

$$ P_k = \mathbb{E}[\epsilon_k \epsilon_k^T] $$

$$tr(P_k) = \mathbb{E}(\epsilon_{1,k}^2) + \mathbb{E}(\epsilon_{2,k}^2) + \cdots + \mathbb{E}(\epsilon_{n,k}^2)$$

Optimisation

$$ K_k = arg\min_{K_k} W_k = tr\bigg( \mathbb{E}(\epsilon_k \epsilon_k^T ) \bigg) = tr(P_k)$$

Let’s derive the optimisation problem.

$$\epsilon_k = x-\hat{x}_k$$

$$ =x-\hat{x}{k-1} – K_k(y_k – C_k \hat{x}{k-1}) $$

$$ = x- \hat{x}{k-1} – K_k (C_k x + v_k – C_k \hat{x}{k-1}) $$

$$ = (I – K_k C_k)(x-\hat{x}_{k-1}) – K_k v_k $$

$$ =(I-K_k C_k )\epsilon_{k-1} – K_k v_k $$

Recall $y_k = C_k x + v_k$ and $\hat{x}k = \hat{x}{k-1} + K_k (y_k – C_k \hat{x}_{k-1})$

So, $\epsilon_k \epsilon_k^T$ would be,

$$\epsilon_k \epsilon_k^T = \bigg((I-K_k C_k )\epsilon_{k-1} – K_k v_k\bigg)\bigg((I-K_k C_k )\epsilon_{k-1} – K_k v_k\bigg)^T$$

$P_k = \mathbb{E}(\epsilon_k \epsilon_k^T)$, and $P_{k-1} = \mathbb{E}(\epsilon_{k-1} \epsilon_{k-1}^T)$.

$\mathbb{E}(\epsilon_{k-1} v_k^T) = \mathbb{E}(\epsilon_{k-1}) \mathbb{E}(v_k^T) =0$ by the white noise property of $\epsilon$ and $v$. However, $\mathbb{E}(v_k v_k^T) = R_k$. Substituting all those into $P_k$, we would get,

$$P_k = (I – K_k C_k)P_{k-1}(I – K_k C_k)^T + K_k R_k K_k^T$$

$$ P_k = P_{k-1} – P_{k-1} C_k^T K_k^T – K_k C_k P_{k-1} + K_k C_k P_{k-1}C_k^T K_k^T + K_k R_k K_k^T $$

$$W = tr(P_k)= tr(P_{k-1}) – tr(P_{k-1} C_k^T K_k^T) – tr(K_k C_k P_{k-1}) + tr(K_k C_k P_{k-1}C_k^T K_k^T) + tr(K_k R_k K_k^T) $$

We take F.O.C. to solve for $K_k = arg\min_{K_k} W_k = tr\bigg( \mathbb{E}(\epsilon_k \epsilon_k^T ) \bigg) = tr(P_k)$ , by letting $\frac{\partial W_k}{\partial K_k} = 0$ . See the Matrix Cookbook and find how to do derivatives w.r.t. $K_k$ .

$$\frac{\partial W_k}{\partial K_k} = -2P_{k-1} C_k^T + 2K_k C_k P_{k-1} C_k^T + 2K_k R_k = 0$$

We solve for $K_k$ ,

$$ K_k = P_{k-1} C_k^T (R_k + C_k P_{k-1} C_k^T)^{-1}$$

, we let $L_k = R_k + C_k P_{k-1} C_k^T$ , and $L_k$ has the following property $L_k = L_k^T$ and $L_k^{-1} = (L_k^{-1})^T$

$$ K_k = P_{k-1} C_k^T L_k^{-1} $$

Plug $K_k = P_{k-1} C_k^T K_k^{-1}$ back into $P_k$ .

$$ P_k = P_{k-1} – K_kC_k P_{k-1} = (I-K_k C_k)P_{k-1} $$

Summary

In the end, the Recursive Least Squared Method could be summarised as the following three equations.

1. Update the Gain Matrix.

$$ K_k = P_{k-1} C_k^T (R_k + C_k P_{k-1} C_k^T)^{-1}$$

2. Update the Estimate.

$$\hat{x}_k = \hat{x}_{k-1} + K_k (y_k – C_k \hat{x}_{k-1})$$

3. Propagation of the estimation error covariance matrix by using this equation.

$(I-K_k C_k)P_{k-1}$

Reference

Introduction to Kalman Filter: Derivation of the Recursive Least Squares Method

Sigmoid & Logistic

Sigmoid function is largely used for the binary classification, in either machine learning algorithm or econometrics.

Why the Sigmoid Function shapes in this form?

Firstly, let’s introduce the odds.

Odds provide a measure of the likelihood of a particular outcome. They are calculated as the ratio of the number of outcomes that produce that outcome to the number that do not.

Odds also have a simple relation with probability: the odds of an outcome are the ratio of the probability that the outcome occurs to the probability that the outcome does not occur. In mathematical terms, p is the probability of the outcome, and 1-p is the probability of not occurring.

$$ odds = \frac{p}{1-p} $$

Odd and Probability

Let’s find some insights behind the probability and the odd. Probability links with the outcomes in that for each outcomes, the probability give its specific corresponding probability. $Pr(Y)$ , where $Y$ is the outcome, and $Pr(\cdot)$ is the probability density function that project outcomes to it’s prob.

What about the odds? Odds is more like a ratio that is calculated by the probability as the formula says.

Implication: Compared to the probability, odds provide more about how the binary classification is balanced or not, but the probability distribution.

Example

Rolling a six-side die. The probability of rolling 6 is $1/6$ , but the odd is $1/5.

Formula

$$ odd = \frac{Pr(Y)}{1-Pr(Y)} $$

, where $Y$ is the outcomes.

Logit

As the probability $Pr(Y)$ is always between $[0,1]$ , the odds must be non-negative, $odd \in [0,\infty]$ . We may want to apply a monotonic transformation to re-gauge that range of odds. We will apply on the logarithm.

$$ Sigmoid/Logistic := log(odds) =log\bigg( \frac{Pr(Y)}{1-Pr(Y)} \bigg) $$

We then get the Sigmoid function.

As the transformation we apply on is monotonic, the Sigmoid function remains the similar properties as the odd. The Sigmoid function keeps the similar implication, representing the balance of the binary outcomes.

Then, we bridge $Y = f(X)$ , the outcome $Y$ is a function of events $X$ . Here, we assume a linear form as $Y = X\beta$ . The sigmoid function would then become a function of $X$ .

$$g(X) = log\bigg( \frac{Pr(X\beta)}{1-Pr(X\beta)} \bigg) $$

$$ e^g = \frac{p}{1-p} $$

$$ p = \frac{e^g}{e^g+1}=\frac{1}{1+e^{-g}}$$

$$ p = \frac{1}{1+e^{-X\beta}}$$

We finally get out logistic sigmoid function as above.

Dirac Delta Function

The Dirac Delta Function could be applied to simplify the differential equation. There are three main properties of Dirac Delta Function.

$$\delta (x-x’) =\lim_{\tau\to0}\delta (x-x’)$$

such that,

$$ \delta (x-x’) = \begin{cases} \infty & x= x’ \ 0 & x\neq x’ \end{cases} $$

$$\int_{-\infty}^{\infty} \delta (x-x’)\ dx =1$$

Three Properties:

Property 1:

$$\delta(x-x’)=0 \quad \quad ,x\neq x’ $$

Property 2:

$$ \int_{x’-\epsilon}^{x’+\epsilon} \delta (x-x’)dx =1\quad \quad ,\epsilon >0 $$

Property 3:

$$\int_{x’-\epsilon}^{x’+\epsilon} f(x)\ \delta (x-x’)dx = f(x’)$$

At $x=x’$ the Dirac Delta function is sometimes thought of has having an “infinite” value. So, the Dirac Delta function is a function that is zero everywhere except one point and at that point it can be thought of as either undefined or as having an “infinite” value.

Taylor Series and Transition Density Functions

See my Github repo for full details.

https://github.com/eightsmile/cqf

1. Trinomial Random Walk

2. Transition Probability Density Function

The transition probability density function, $p(y,t;y’,t’)$ , is defined by,

$$ Prob(a<y'<b, at\ time \ t’ | y \ at \ time\ t) = \int_a^b p(yet;y’,t’)dy’$$

In words this is “the probability that the random variable y ′ lies between a and b at time t ′ in the future, given that it started out with value y at time t.”

Think of y and t as being current values with y ′ and t ′ being future values. The transition probability density function can be used to answer the question,

“What is the probability of the variable y ′ being in a speciﬁed range at time t ′ in the future given that it started out with value y at time t?”

Our Goal is to find the transition probability p.d.f., and so we find the relationship between $p(y,t;y’,t’)$ , and $p(y,t;y’,t’-\delta t)$ ,

3. From the Trinomial model to the Transition Probability Density function

The variable y can either rise, fall or take the same value after a time step δt. These movements have certain probabilities associated with them.

We are going to assume that the probability of a rise and a fall are both the same, $\alpha<\frac{1}{2}$ . (But, of course, this can be generalized. Why would we want to generalize this?)

3.1 The Forward Equation

Given ${y,t}$ , or says ${y,t}$ the current and previous. ${y’,t’}$ are variate in the future time.

The probability of being at $y’$ at time $t’$ is related to the probabilities of being at the previous three values and moving in the right direction:

$$ p(y,t;y’,t’) = \alpha \ p(y,t;y’+\delta y,t’-\delta t) + \ (1-2\alpha) \ p(y,t;y’,t’-\delta t) + \alpha \ p(y,t;y’-\delta y,t’-\delta t) $$

Given ${y,t}$ , we find relationship between ${y’,t’}$ and ${y’\pm \delta y,t’-\delta t}$ that is y’ and t’ a bit time previously.

Remember, our goal is to find a solution of $p(.)$ , we try to solve the above equation.

3.2 Taylor Series Expansion

We expand each term of the equation.

$$ p(y,t;y’,t’) = \alpha \ p(y,t;y’+\delta y,t’-\delta t) + \ (1-2\alpha) \ p(y,t;y’,t’-\delta t) + \alpha \ p(y,t;y’-\delta y,t’-\delta t) $$

Why we do that? Because there are too many variables in it, hard to solve it. We have to reduce the dimension.

$$ p(y,t;y’+\delta y,t’-\delta t)\approx \ p(y,t;y’,t) – \delta t \frac{\partial p}{\partial t’} +\delta y \frac{\partial p}{\partial y’} + \frac{1}{2}\delta y^2 \frac{\partial^2 p}{\partial y’^2} + O(\frac{\partial^2 p}{\partial t’^2}) $$

$$ p(y,t;y’,t’-\delta t)\approx \ p(y,t;y’,t) – \delta t \frac{\partial p}{\partial t’} + O(\frac{\partial^2 p}{\partial t’^2}) $$

$$ p(y,t;y’-\delta y,t’-\delta t)\approx \ p(y,t;y’,t) – \delta t \frac{\partial p}{\partial t’} -\delta y \frac{\partial p}{\partial y’} + \frac{1}{2}\delta y^2 \frac{\partial^2 p}{\partial y’^2} + O(\frac{\partial^2 p}{\partial t’^2}) $$

Plug them back into that equation, and after cancel out terms repeated we would left with,

$$ \frac{\partial p}{\partial t’} =\alpha \frac{\delta y^2}{\delta t} \frac{\partial^2 p}{\partial y’^2} + O(\frac{\partial^2 p}{\partial t’^2})$$

We drop those derivative terms with order greater and equal than $O(\frac{\partial^2 p}{\partial t’^2})$ .

$$ \frac{\partial p}{\partial t’} =\alpha \frac{\delta y^2}{\delta t} \frac{\partial^2 p}{\partial y’^2} $$

In the RHS, we focus on $\alpha \frac{\delta y^2}{\delta t}$ , firstly. The denominator and numerator have to be in the same order to make that term definite. Or, say $\delta y \sim O(\sqrt{\delta t})$ .

We thus let $c^2 = \alpha \frac{\delta y^2}{\delta t}$

$$ \frac{\partial p}{\partial t’} =c^2 \frac{\partial^2 p}{\partial y’^2} $$

The above equation is also named Fokker–Planck or forward Kolmogorov equation.

Now, we have a partial differential equation. Solve it, we can get the form of $p$ .

3.3 Backward Equation works similar.

$$ p(y,t;y’,t’) = \alpha \ p(y+\delta y,t+\delta t;y’,t’) + \ (1-2\alpha) \ p(y,t+\delta t;y’,t’) + \alpha \ p(y-\delta y,t+\delta t;y’,t’) $$

the dimension-reduced result is the blow, and it is called the backward Kolmogorov equation.

$$ \frac{\partial p}{\partial t’} + c^2 \frac{\partial^2 p}{\partial y’^2} =0 $$

4. Solve the Forward Kolmogorov Equation

We will solve for $p$ right now! However, we will solve it by assuming similarity solution.

$$ \frac{\partial p}{\partial t’} =c^2 \frac{\partial^2 p}{\partial y’^2} $$

This equation has an inﬁnite number of solutions. It has diﬀerent solutions for diﬀerent initial conditions and diﬀerent boundary conditions. We need only a special solution here. The detailed process of finding that solution is showing as the following,

4. 1 Assume a Solution Form

$$ p=t’^a f(\frac{y’}{t’^b}) = t’^a f(\xi)$$

$$ \xi = \frac{y’}{t’^b}$$

, where $a$ , and $b$ are indefinite variables.

Again, don’t ask why it is in this form, because it is a special solution!

4.2 Derivation

$$\frac{\partial p}{\partial y’}=t’^{a-b}\frac{df}{d\xi}$$

$$\frac{\partial^2 p}{\partial y’^2}=t’^{a-2b}\frac{d^2f}{d\xi^2}$$

$$\frac{\partial p}{\partial t’}=at’^{a-1}f(\xi)+by’t’^{a-b-1}\frac{df}{d\xi}$$

Substitue back into the forward Kolmogorov equation (remember $y’ = t’^b \xi$ ), we get,

$$ af(\xi) – b\xi \frac{df}{d\xi} = c^2 t’^{-2b+1} \frac{d^2f}{d\xi^2}$$

4.3 Choose $b$

As we need the RHS to be independent of $t’$ , we could choose the value of $b=\frac{1}{2}$ , to let the $t’$ has a power of 0. Why we do that? Because we aim to reduce the partial differential equation to be a ordinary differential equation, in which the only variable is $\xi$ , and $t’$ disappear.

By assuming the special form of $p$ , and letting $b= 1/2$ , our forward Kolmogorov becomes,

$$ af(\xi) – \frac{1}{2}\xi \frac{df}{d\xi} = c^2 \frac{d^2f}{d\xi^2}$$

$$ p=t’^a f(\frac{y’}{\sqrt{t’}}) = t’^a f(\xi)$$

$$ \xi = \frac{y’}{\sqrt{t’}}$$

4.4 Choose $a$

$$p=t’^a f(\frac{y’}{\sqrt{t’}}) $$

We know that $p$ is the transition p.d.f., its integral must be equal to ‘1’. $t’$ is independent by the definition of random walk behaviour, so we do only integrate $p$ , w.r.t. $y’$ .

$$\int_{\mathbb{R}}p\ dy’ = \int_{\mathbb{R}} t’^a f(\frac{y’}{\sqrt{t’}})\ dy’ = 1$$

$$ \int_{\mathbb{R}} t’^a f(\frac{y’}{\sqrt{t’}})\ dy’ = 1 $$

, by replace $x = \frac{y’}{\sqrt{t’}}$ ,

$$ \int_{\mathbb{R}} t’^{a+1/2} f(x)\ dx =t’^{a+1/2} \int_{\mathbb{R}} f(x)\ dx= 1 $$

$t’$ is independent, so the above equation would be equal to ‘1’ regardless the power of $t’$. Thus, $a = -\frac{1}{2}$ for sure.

Also, we get $\int_{\mathbb{R}} f(x)\ dx= 1$ .

4.5 Integrate! Solve it!

By assuming the special form of $p$ , and letting $a=-1/2$ , $b=1/2$ , we get,

$$ -\frac{1}{2}f(\xi) – \frac{1}{2}\xi \frac{df}{d\xi} = c^2 \frac{d^2f}{d\xi^2}$$

$$ p=\frac{1}{\sqrt{t’}} f(\frac{y’}{\sqrt{t’}}) = \frac{1}{\sqrt{t’}}f(\xi)$$

$$ \xi = \frac{y’}{\sqrt{t’}}$$

The forward Kolmogorov equation becomes,

$$ -\frac{1}{2}\bigg(f(\xi) – \xi \frac{df}{d\xi} \bigg)= c^2 \frac{d^2f}{d\xi^2}$$

$$ -\frac{1}{2}\bigg( \frac{d \xi f(\xi)}{d \xi} \bigg)= c^2 \frac{d^2f}{d\xi^2}$$

, as $f(\xi) – \xi \frac{df}{d\xi} = \frac{d \xi f(\xi)}{d \xi}$ .

Integrate 1st Time

$$ -\frac{1}{2}\xi f(\xi)= c^2 \frac{df}{d\xi} + constant$$

There’s an arbitrary constant of integration that could go in here but for the answer we want this is zero. We need only a special solution, so we can set that arbitrary constant term be zero.

So, the eq could be rewritten as,

$$ -\frac{1}{2c^2}\xi d\xi = \frac{1}{f(\xi)}df $$

Integrate 2nd Time

$$ ln\ f(\xi) = -\frac{\xi^2}{4c^2} + C$$

Take exponential, $f(\xi) = e^C e^{-\frac{\xi^2}{4c^2}} = A e^{-\frac{\xi^2}{4c^2}}$ .

Find $A$

The Last Step here is to find the exact value of $A$ . $A$ is chosen such that the integral of $f$ is one.

$$\int_{\mathbb{R}}f(\xi)\ d\xi =1$$

$$ \int_{\mathbb{R}}A e^{-\frac{\xi^2}{4c^2}} \ d\xi = 2cA\int_{\mathbb{R}} e^{-\frac{\xi^2}{4c^2}} \ d\big(\frac{\xi}{2c}\big) =1 $$

$$ 2cA \sqrt{\pi} = 1 $$

, so we get $A = \frac{1}{2c\sqrt{\pi}}$

Plug $f(\xi), a, b, A$ back into $p = t^a f(\xi)$ .

$$ p(y’)=\frac{1}{2c\sqrt{\pi \ t’}}e^{-\frac{\xi^2}{4c^2}} =\frac{1}{2c\sqrt{\pi \ t’}}e^{-\frac{y’^2}{4c^2t’}} $$

$p(.)$ now is normal like distributed.

$$N(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

So, we may say $\mu_{y’}=0$ , and $\sigma^2_{y’}=2c^2t’$ . Or, $y’ \sim N(0, 2c^2t’)$ .

5. Summary

$$p(y’)=\frac{1}{2c\sqrt{\pi \ t’}}e^{-\frac{\xi^2}{4c^2}} =\frac{1}{2c\sqrt{\pi \ t’}}e^{-\frac{y’^2}{4c^2t’}} $$

Finally, we solved the transition probability density function $p(.)$ . By assuming the forward or backward type of trinomial model, we find a partial differential relationship. Then, assuming a special form of $p(.)$ by similarity method, we solve it.

The meaning is that $p(y’)=\frac{1}{2c\sqrt{\pi \ t’}}e^{-\frac{\xi^2}{4c^2}} =\frac{1}{2c\sqrt{\pi \ t’}}e^{-\frac{y’^2}{4c^2t’}}$ is one of the transition probability density function that can satisfy the trinomial random walk.

Also, we find that $p(.)$ is normally liked distributed.

Duality

$$min\ f_0(x), x \in \mathbb{R}^n$$

$$s.t. f_i(x)\leq 0, \text{for i from 1 to m}$$

$$ \quad h_i(x)=0, \text{for i from 1 to q}$$

That is, in Lagrangian form,

$$L(x,\lambda,\gamma)=f_0((x)+\sum \lambda_i f_i (x) +\sum \gamma_i h_i(x)$$

$$ \min_{x} \max_{\lambda,\gamma} L(x,\lambda, \gamma) $$

$$s.t. \lambda \geq 0$$

The Duality Problem is,

$$g(\lambda, \gamma) = \min_{x} L(x,\lambda, \gamma)$$

$$\max_{\lambda, \gamma} g(\lambda, \gamma) $$

$$s.t. \nabla_x L(x,\lambda, \gamma)=0$$

$$\quad \lambda \geq 0$$

Why Duality?

We change the original problem in to the duality, which becomes a convexity optimisation problem.

Convexity Optimisation

The object function is convex. (or the negative of a concavity)
The feasible set is a convex set.

See further study.

KKT for Optimisation

Optimisation

Suppose $f, g_1, g_2, \dots, g_m$ are differentiable.

$$ \max_{x_1, x_2,…, x_n} f(x_1, x_2, …, x_n) $$

$$s.t. g_1(x_1, x_2, …, x_n)\geq 0 $$

$$\quad g_2(x_1, x_2, …, x_n)\geq 0 $$

$$\quad ……$$

$$\quad g_m(x_1, x_2, …, x_n)\geq 0 $$

There are $m$ constraints $g_i, i \in (1,m)$ , and one object function $f$ to maximum. $n$ variables.

KKT Conditions

Suppose the constraint qualification is satisfied. A local maximiser of the above problem $\bar{x} = (\bar{x_1}, \bar{x_2},… \bar{x_n})$ together with some $\bar{lambda_1}, \bar{lambda_2}, …, \bar{lambda_m} \geq 0$ satisfies the following:

$$ \frac{\partial f}{\partial x_i}(\bar{x}) + \lambda_1 \frac{\partial g_1}{\partial x_i}(\bar{x})+ \lambda_2 \frac{\partial g_2}{\partial x_i}(\bar{x}) +…+ \lambda_{m} \frac{\partial g_m}{\partial x_i}(\bar{x}) = 0$$

for $i=1,2,\dots,n$ , and

$$\bar{\lambda_k} g_k(\bar{x})=0$$

for $k=1,2,\dots,m$

We have state “Suppose the constraint qualification is satisfied” in the beginning. What is the qualification?

See step 4 in the next section.

Solve the Maximisation Problem

Step 1. Step up the Lagrangian
Step 2. Write down the KKT conditions and constraints.
Step 3. Solve the equation set. (P.S. Discuss the complementary slackness, and confirm the possible solutions)
Step 4. Check constraint qualification.

Step 4.

Constraint Set, $C=\{ (x_1, x_2, …, x_n)\in \mathbb{R}^n | g_k(x_1, x_2, …, x_m) \geq 0 \}$ for $k=1,2,\dots,m$ .

The constraint qualification is satisfied at $\bar{x} in C$ , if the constraints that hold at $\bar{x}$ with equality are independent. That is, if the $m$ vectors

$$ \nabla g_k(\bar{x})=\bigg( \frac{\partial g_k}{\partial x_1}(\bar{x}), \frac{\partial g_k}{\partial x_2}(\bar{x}),…,\frac{\partial g_k}{\partial x_n}(\bar{x}) \bigg)’ $$

for $k=1,2,\dots,m$ are independent.

Black and Scholes, 1973

See A bit Stochastic Calculus .

For,

$$ dS_t = \mu S_t \ dt +\sigma S_t \ dW_t $$

In calculating $d f(S_t)$ , we would get, (by Taylor Expansion)

$$ df(S_t) = \bigg( \frac{\partial f}{\partial t} + \frac{\partial f}{\partial S_t}\mu S_t +\frac{1}{2}\frac{\partial^2 f}{\partial S_t^2}\sigma^2 S_t^2 \bigg)dt + \frac{\partial f}{\partial S_t}\sigma S_t \ dW_t $$

A Special Form of $f(\cdot)$ is $f(S) = log(S)$ ,

$$ d\ log(S_t) = \bigg( \mu – \frac{1}{2}\sigma^2 \bigg)dt + \sigma \ dW_t $$

We get $Y_t = log S_t$ is the price of a derivative security with respect to $S_t$ and $t$ and then,

$$ dY_t= \bigg( \frac{\partial Y_t}{\partial t} + \frac{\partial Y_t}{\partial S_t}\mu S_t +\frac{1}{2}\frac{\partial^2 Y_t}{\partial S_t^2}\sigma^2 S_t^2 \bigg)dt + \frac{\partial Y_t}{\partial S_t}\sigma S_t \ dW_t $$

Consider a portfolio $\Pi$ is constructed with (1) short one derivative, and (2) long some fraction of stocks, $\Delta$ , such that the portfolio is risk natural. ( $\Delta = \frac{\partial Y}{\partial S}$ )

$$ \Pi_t = -Y +\Delta \ S_t $$

Differentiate it,

$$ d\Pi_t = -dY_t +\frac{dY}{dS}dS_t $$

Subtitute $dY_t$ and $dS_t$ into the above equation, we would then get the stochastic process of portfolio, by Ito’s Lemma.

$$ d\Pi_t =-\bigg( \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 \bigg) dt $$

The diffusion term $dW_t$ disappears, and that means the portfolio is riskless during the interval $dt$ . Under a no arbitrage assumption, this portfolio can only earn the riskless return, $r$ .

$$ d\Pi_t =r\Pi_t \ dt $$

Subtitute $d\Pi_t$ and $\Pi_t$ into, we would get the Partial Differential Equation (PDE) / Black-Scholes equation,

$$ – \bigg( \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 \bigg) dt = r\bigg(- Y_t + \frac{\partial Y}{\partial S}S_t \bigg)dt $$

$$ rY_t = \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 + \frac{\partial Y}{\partial S}S_t $$

Then, we guess (where $U(.)$ is a function of $S_t$ at time t=T),

$$ Y_t = e^{-r(T-t)} U(S_T) $$

For a European call with strike price, K, $U(S_T)$ would be the payoff at maturity,

$$ U(S_T) = max( S – K , 0 ) $$

Finally, through a series of process to find a specific solution of the PDE, we can solve the value of call, ( $\Phi(\cdot)$ is the cumulative standard normal distribtion)

$$ c = S_t\Phi(d_1) – K e^{-r (T-t)}\Phi(d_2) $$

with,

$$ d_1 =\frac{ log(S_t/K) + (r-\frac{1}{2}\sigma^2)(T-t)}{\sigma \sqrt{T-t}} $$

$$ d_1 = \sigma \sqrt{T-t} $$

42-2010_Handbook Download

Maximum Likelihood Estimator

Joint Probability Density Function or Likelihood Function:

$f(x_1, x_2, …, x_n|\theta) = f(x_1|\theta) f(x_2|\theta) … f(x_N| \theta) $

$f(x_1, x_2, …, x_n|\theta) = \prod_{i=2}^n f(X_i | \theta) =L(\theta)$

A likelihood function is the density function regarded as a function of $\theta$ .

$ L(\theta |x) = \prod f(x|\theta) $, $\theta \in \Theta$

As we know the sample, we need to ‘guess’ the parameter $\theta$ to let the joint prob distribution have a maximum probability.

The Maximum Likelihood Estimator (MLE) is :

$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) $

To simplify the calculation, we normally apply a logarithm transformation,

$$ l(\theta)= log\ L $$

$$log L(\theta |x) =log \prod f(x|\theta)= \sum_{i=1}^N log\ f_{\theta} (y_i) $$

As the logarithm transformation is monotonous,

$$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) \Leftrightarrow $$

$$\hat{\theta}(x) = arg\max_{\theta} l(\theta|x) $$

To review it, we aim to estimate the parameters. The logic of the MLE estimator is that we guess a probability distribution, fitting into our data (sample). Then, we find or guess the value of parameters such that the joint probability function (likelihood function) achieves the maximum value.

In another word, we guess the probability dist and parameters to make sample data achieve highest possibility.

To find the $argmax_{\theta} l(\theta)$ is equivalent to $argmin_{\theta} -l(\theta)$

A Normal Distribution Example

A normal distribution example,for random sample $x_i$ , i=1, 2, 3, …, N

$$f_{\theta}(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

We substitute $f_{\theta}(x)$ into the log-likelihood distribution, then we find,

$$ l(\mu, \sigma^2)=log\ L(\mu,\sigma^2) $$

$$= -\frac{n}{2}\big( log2\pi + log\sigma^2 \big) -\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i -\mu)^2 $$

Taking the First Order Conditions, F.O.C.

By setting $\frac{\partial l}{\partial \mu}= 0$ and $\frac{\partial l}{\partial \sigma^2}= 0$ , we solve $\mu_{MLE}$ and $\sigma^2_{MLE}$ .

$$ \mu_{MLE} =\bar{x}$$

$$\sigma^2_{MLE} =\frac{1}{n} \sum_{i=1}^n (x_i – \bar{x})$$

P.S. We can prove that there are the local maximum, coz the second partial derivatives are negative.

def MLE_norm(x):
    mu_hat = np.mean(x)
    sigma2_hat = np.var(x)
    return mu_hat, sigma2_hat

#----------

mu = 5
sigma = 2.5
N = 10000

np.random.seed(0)
x = np.random.normal( mu, sigma, size=(N,) )

#----------

mu_hat, sigma2_hat = MLE_norm(x)

for_mu_hat = ' $\hat{\mu} = '+format(round(mu_hat,2))+'$ '
for_sigma2_hat = ' $\hat{\sigma} = '+format(round(np.sqrt(sigma2_hat),2))+'$ '
print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))

By our calculation, we know that mathematically the MLE estimator of mu and sigma are the mean and variance.

Perform MLE Numerically

If the log-likelihood function wasn’t continuous or differentiable. Can solve numerically through an optimisation problem where the objective function is …

In other words, sometimes we cannot calculate the MLE estimator by hand, because the log-likelihood function might not be differentiable. What can we do?

We apply the computer to interactively ‘guess’ values (Optimizer) that can end up with the maximum log-likelihood (or a minimum of negative log-likelihood).

The maximum likelihood estimator (MLE):

$$\hat{\theta}(x) = arg max_{\theta} L(\theta|x)$$

$$L(\theta) = \prod_{i=1}^N f(X_i | \theta) $$

$$ log\ f(X|\theta)=\sum_{i=1}^N log\ f(x_i|\theta)$$

$$ arg\max_{\theta} l(\theta) \Leftrightarrow arg\min_{\theta} -l(\theta) $$

Why we use the minimizer instead? There is only minimizer available in the scipy package, and those mini&max are basically equivalent.

def log_likelihood(theta, x):
    mu = theta[0]
    sigma = theta[1]
    l_theta = np.sum( np.log( sc.stats.norm.pdf(x, loc=mu, scale=sigma) ) )
    # there is a negative sign in front. We minimise it later.
    return -l_theta 

# Constraint Function that Restrict sigma to be positive.
def sigma_pos(theta):
    sigma = np.abs(theta[1])
    return sigma

#--------

cons_set = {'type':'ineq', 'fun': sigma_pos}

theta0 = [2,3] # inital guess. there could be any random numbers, coz it is just the intital value inputed into the optimiser
# find the minimum value of the log likelihood function
opt = sc.optimize.minimize(fun=log_likelihood, x0=theta0, args=(x,), constraints=cons_set)

for_mu_hat = ' $\hat{\mu} = '+format(round(opt.x[0],2))+'$ '
for_sigma2_hat = ' $\hat{\sigma} = '+format(round(opt.x[1],2))+'$ '

print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))

Maximum-Likelihood-Estimator Download

Risk-Neutral Pricing

Our basic logic is,

$$ \text{Option Value} =\text{Discounted Expectation of the Payoff} $$

However, there is not a perfect correlation between the Option and the Underlying Asset. That fact encourage us to do the risk-netural method for valuing an option.

Geometric Brownian Motion

Non-dividend paying stock, ( $S_0>0$ )

$$ dS_t = \mu S_t\ dt+\sigma S_t \ dW_t $$

Money market, or bank account, ( $B_0 = 1$ )

$$ dB_t = r B_t\ d_t $$

and,

$$ \{ W_t \}_{t\in [0,T]} : \mathbb{P} \sim Brownian \ Motion $$

$ \{S_t\} $ is not a martingals under $\mathbb{P}$ because $\mu S_t$ is not zero.

So, we apply a transformation to $S_t$ and to make it be a martingal (eliminate the drift term).

We here look at $\{ \frac{S_t}{B_t} \}$ ,

$$ d\big( \frac{S_t}{B_t} \big)=-r\frac{S_t}{B_t}dt +\frac{1}{B_t} dS_t +\frac{1}{2}dS_t (-r\frac{1}{B_t}dt) $$

$$ d\big( \frac{S_t}{B_t} \big)=-r\frac{S_t}{B_t}dt +\bigg( \frac{1}{B_t}-\frac{1}{2}r\frac{1}{B_t}dt \bigg)\bigg( \mu S_t\ dt +\sigma S_t \ dW_t \bigg) $$

The cross terms dt & dW would decay to zero quickly in the stochastic integration. We therefore would get,

$$ d\big( \frac{S_t}{B_t} \big) = \sigma ( \frac{S_t}{B_t} )\underbrace{\bigg( \frac{\mu – r}{\sigma} dt + dW_t \bigg)}_{d\tilde{W_t}} $$

The drift disappears, instead we get $d\big( \frac{S_t}{B_t} \big)d\tilde{W_t}$

$$ d\big( \frac{S_t}{B_t} \big)d\tilde{W_t} $$

$$ d\tilde{W_t} = \frac{\mu -r}{\sigma} dt +dW_t $$

by Girsanov’s Theorem that $\{a_t \}_{t\in [0,T]}$ be and adapted $\{ \mathcal{F} \}_{t\in [0,T]}$ Ito Process, so that $\mathbb{Q}$ is an equivalent measure on $\mathbb{P}$ such that $\tilde{W_t}$ is a Brownian Motion, $\tilde{W_t} = W_t +\int_0^t a_u du$ .

$$ dS_t = rS_t \ d_t +\sigma S_t \ d\tilde{W_t} $$

Implication

The implication is that,

We can construct a relicating portfolio $(S_t, B_t)$ with value process $\{ V_t \}_{t\in[0,T]}$ . Under $\mathbb{Q}$ the discounted value process exists that is a martingale $\{ e^{-rt}V_t \}$ .

$$ e^{-rt} C_t =\mathbb{E}_{\mathbb{Q}}[ e^{-rT} C_T | \mathcal{F}_t ] $$

$$ =\mathbb{E}_{\mathbb{Q}}[ e^{-rT} V_T | \mathcal{F}_t ] = e^{-rt} V_t $$

Risk Neutral Expectation Pricing Formula

We therefore get,

$$ \frac{C_t}{B_t} = \mathbb{E}_{\mathbb{Q}}\bigg[ \frac{C_T}{B_t} \bigg| \mathcal{F}_t \bigg] $$

$$ B_T=1,\quad and \quad B_t=exp\{ \int_0^t r_s ds \} $$

Tensorflow – Keras – beginner-Example

Tensorflow-Keras-beginner-example Download

Recursive Least Squared Method

Notation

Optimisation

Summary

Reference

Firstly, let’s introduce the odds.

Odd and Probability

Example

Formula

Logit

1. Trinomial Random Walk

2. Transition Probability Density Function

3. From the Trinomial model to the Transition Probability Density function

3.1 The Forward Equation

3.2 Taylor Series Expansion

3.3 Backward Equation works similar.

4. Solve the Forward Kolmogorov Equation

4. 1 Assume a Solution Form

4.2 Derivation

4.3 Choose b

4.4 Choose a

4.5 Integrate! Solve it!

Integrate 1st Time

Integrate 2nd Time

Find A

5. Summary

Why Duality?

Convexity Optimisation

Optimisation

KKT Conditions

Solve the Maximisation Problem

A Normal Distribution Example

Perform MLE Numerically

Geometric Brownian Motion

Implication

Risk Neutral Expectation Pricing Formula

4.3 Choose $b$

4.4 Choose $a$

Find $A$