Fanyu Zhao – Page 5

Linear Regression Paddle for Beginner

try_ML_LinearReg Download

Experimentally decompose each components by Numpy.

CS231n and Deep Learning from Scratch

Great Starting Points of Deep Learning.

https://cs231n.github.io

Duality

$$min\ f_0(x), x \in \mathbb{R}^n$$

$$s.t. f_i(x)\leq 0, \text{for i from 1 to m}$$

$$ \quad h_i(x)=0, \text{for i from 1 to q}$$

That is, in Lagrangian form,

$$L(x,\lambda,\gamma)=f_0((x)+\sum \lambda_i f_i (x) +\sum \gamma_i h_i(x)$$

$$ \min_{x} \max_{\lambda,\gamma} L(x,\lambda, \gamma) $$

$$s.t. \lambda \geq 0$$

The Duality Problem is,

$$g(\lambda, \gamma) = \min_{x} L(x,\lambda, \gamma)$$

$$\max_{\lambda, \gamma} g(\lambda, \gamma) $$

$$s.t. \nabla_x L(x,\lambda, \gamma)=0$$

$$\quad \lambda \geq 0$$

Why Duality?

We change the original problem in to the duality, which becomes a convexity optimisation problem.

Convexity Optimisation

The object function is convex. (or the negative of a concavity)
The feasible set is a convex set.

See further study.

Kaggle Posts

Here below is what I was working on last month. I tried some simple application of machine learning algorithms. Codes could be found in my kaggle page below.

https://www.kaggle.com/eightsmile/code

KKT for Optimisation

Optimisation

Suppose $f, g_1, g_2, \dots, g_m$ are differentiable.

$$ \max_{x_1, x_2,…, x_n} f(x_1, x_2, …, x_n) $$

$$s.t. g_1(x_1, x_2, …, x_n)\geq 0 $$

$$\quad g_2(x_1, x_2, …, x_n)\geq 0 $$

$$\quad ……$$

$$\quad g_m(x_1, x_2, …, x_n)\geq 0 $$

There are $m$ constraints $g_i, i \in (1,m)$ , and one object function $f$ to maximum. $n$ variables.

KKT Conditions

Suppose the constraint qualification is satisfied. A local maximiser of the above problem $\bar{x} = (\bar{x_1}, \bar{x_2},… \bar{x_n})$ together with some $\bar{lambda_1}, \bar{lambda_2}, …, \bar{lambda_m} \geq 0$ satisfies the following:

$$ \frac{\partial f}{\partial x_i}(\bar{x}) + \lambda_1 \frac{\partial g_1}{\partial x_i}(\bar{x})+ \lambda_2 \frac{\partial g_2}{\partial x_i}(\bar{x}) +…+ \lambda_{m} \frac{\partial g_m}{\partial x_i}(\bar{x}) = 0$$

for $i=1,2,\dots,n$ , and

$$\bar{\lambda_k} g_k(\bar{x})=0$$

for $k=1,2,\dots,m$

We have state “Suppose the constraint qualification is satisfied” in the beginning. What is the qualification?

See step 4 in the next section.

Solve the Maximisation Problem

Step 1. Step up the Lagrangian
Step 2. Write down the KKT conditions and constraints.
Step 3. Solve the equation set. (P.S. Discuss the complementary slackness, and confirm the possible solutions)
Step 4. Check constraint qualification.

Step 4.

Constraint Set, $C=\{ (x_1, x_2, …, x_n)\in \mathbb{R}^n | g_k(x_1, x_2, …, x_m) \geq 0 \}$ for $k=1,2,\dots,m$ .

The constraint qualification is satisfied at $\bar{x} in C$ , if the constraints that hold at $\bar{x}$ with equality are independent. That is, if the $m$ vectors

$$ \nabla g_k(\bar{x})=\bigg( \frac{\partial g_k}{\partial x_1}(\bar{x}), \frac{\partial g_k}{\partial x_2}(\bar{x}),…,\frac{\partial g_k}{\partial x_n}(\bar{x}) \bigg)’ $$

for $k=1,2,\dots,m$ are independent.

Black and Scholes, 1973

See A bit Stochastic Calculus .

For,

$$ dS_t = \mu S_t \ dt +\sigma S_t \ dW_t $$

In calculating $d f(S_t)$ , we would get, (by Taylor Expansion)

$$ df(S_t) = \bigg( \frac{\partial f}{\partial t} + \frac{\partial f}{\partial S_t}\mu S_t +\frac{1}{2}\frac{\partial^2 f}{\partial S_t^2}\sigma^2 S_t^2 \bigg)dt + \frac{\partial f}{\partial S_t}\sigma S_t \ dW_t $$

A Special Form of $f(\cdot)$ is $f(S) = log(S)$ ,

$$ d\ log(S_t) = \bigg( \mu – \frac{1}{2}\sigma^2 \bigg)dt + \sigma \ dW_t $$

We get $Y_t = log S_t$ is the price of a derivative security with respect to $S_t$ and $t$ and then,

$$ dY_t= \bigg( \frac{\partial Y_t}{\partial t} + \frac{\partial Y_t}{\partial S_t}\mu S_t +\frac{1}{2}\frac{\partial^2 Y_t}{\partial S_t^2}\sigma^2 S_t^2 \bigg)dt + \frac{\partial Y_t}{\partial S_t}\sigma S_t \ dW_t $$

Consider a portfolio $\Pi$ is constructed with (1) short one derivative, and (2) long some fraction of stocks, $\Delta$ , such that the portfolio is risk natural. ( $\Delta = \frac{\partial Y}{\partial S}$ )

$$ \Pi_t = -Y +\Delta \ S_t $$

Differentiate it,

$$ d\Pi_t = -dY_t +\frac{dY}{dS}dS_t $$

Subtitute $dY_t$ and $dS_t$ into the above equation, we would then get the stochastic process of portfolio, by Ito’s Lemma.

$$ d\Pi_t =-\bigg( \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 \bigg) dt $$

The diffusion term $dW_t$ disappears, and that means the portfolio is riskless during the interval $dt$ . Under a no arbitrage assumption, this portfolio can only earn the riskless return, $r$ .

$$ d\Pi_t =r\Pi_t \ dt $$

Subtitute $d\Pi_t$ and $\Pi_t$ into, we would get the Partial Differential Equation (PDE) / Black-Scholes equation,

$$ – \bigg( \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 \bigg) dt = r\bigg(- Y_t + \frac{\partial Y}{\partial S}S_t \bigg)dt $$

$$ rY_t = \frac{\partial Y}{\partial t} + \frac{1}{2}\frac{\partial^2 Y}{\partial S^2} \sigma^2 S_t^2 + \frac{\partial Y}{\partial S}S_t $$

Then, we guess (where $U(.)$ is a function of $S_t$ at time t=T),

$$ Y_t = e^{-r(T-t)} U(S_T) $$

For a European call with strike price, K, $U(S_T)$ would be the payoff at maturity,

$$ U(S_T) = max( S – K , 0 ) $$

Finally, through a series of process to find a specific solution of the PDE, we can solve the value of call, ( $\Phi(\cdot)$ is the cumulative standard normal distribtion)

$$ c = S_t\Phi(d_1) – K e^{-r (T-t)}\Phi(d_2) $$

with,

$$ d_1 =\frac{ log(S_t/K) + (r-\frac{1}{2}\sigma^2)(T-t)}{\sigma \sqrt{T-t}} $$

$$ d_1 = \sigma \sqrt{T-t} $$

42-2010_Handbook Download

Volatility Forecast – Ornstein Uhlenbeck Process

VolatilityForecast_OrnsteinUhlenbeck Download

See notes and the realisation.

We assume the volatility of returns follows a stochastic process. Define it as the Ornstein-Uhlenbeck process,

$$dX_t = \kappa (\theta – X_t)dt +\sigma \ dW_t$$

What’s the distribution function of the Ornstein-Uhlenbeck process?

We could apply the MLE estimation to find the parameter of the above random process.

Recall our Vasicek Form Ornstein-Uhlenbeck process is like the following:

$$ dX_t = \kappa (\theta – X_t)dt +\sigma \ dW_t $$

Multiply both sides by $e^{kt}$ , then we get,

$$ e^{kt}dX_t = \kappa e^{kt} \theta \ dt – \kappa e^{kt} X_t \ dt +\sigma e^{kt}\ dW_t $$

We know that $d( e^{kt} X_t )=e^{kt}dX_t + k e^{kt}X_t dt$ , and substitute it inside. Then, we get,

$$ d(e^{kt}X_t)=e^{kt}\theta \ dt + e^{kt}\sigma \ dW_t $$

Take an integral from [0,T],

$$ \int_0^T d(e^{kt}X_t)= \int_0^T e^{kt}\theta \ dt + \int_0^T e^{kt}\sigma \ dW_t $$

$$ X_T = X_0 e^{-kT} +\theta (1-e^{-kT}) + \int_0^T e^{-k(T-t)}\sigma \ dW_t $$

$\int_0^T e^{-k(T-t)}\sigma \ dW_t \sim N(0, \sigma^2\int_0^T e^{-2k(T-t)}dt)$

We then find $\mathbb{E}(X_T)$ and $Var(X_T)$ .

$ \mathbb{E}(X_T) =\mathbb{E}\bigg( X_0 e^{-kT} +\theta (1-e^{-kT}) + \int_0^T e^{-k(T-t)}\sigma \ dW_t \bigg)$

$ \mathbb{E}(X_T)= X_0 e^{-kT} +\theta (1-e^{-kT}) $

$Var(X_T) = \mathbb{E}\bigg( \big( X_T – \mathbb{E}(X_T) \big)^2 \bigg)$

$Var(X_T)= \mathbb{E}\bigg( \big( \int_0^T e^{-k(T-t)}\sigma \ dW_t \big)^2 \bigg) $

By Ito’s Isometry: $I(t)=\int_0^t \Delta(s)dW_s$ , then

$Var[I(t)]=\mathbb{E}[(I^2(t))]=\int_0^t \Delta^2(s)ds$

Then,

$ Var(X_T)= \int_0^T e^{-2k(T-t)}\sigma^2 \ dt $

$ Var(X_T)= \frac{\sigma^2}{2k}\big( 1-e^{-2kT} \big) $

Therefore, we finally get,

$ X_T \sim N\bigg(X_0 e^{-kT} +\theta (1-e^{-kT}), \frac{\sigma^2}{2k}\big( 1-e^{-2kT} \big) \bigg)$

MLE of Ornsterin-Uhlenbeck Process

$X_t \sim N(\cdot,\cdot)$

,where $\mathbb{E}(X_{t+\delta t})=X_0 e^{-k\ \delta t} +\theta (1-e^{-k\ \delta t})$ , and $Var(X_{t+\delta t})= \frac{\sigma^2}{2k}(1-e^{-2k\ \delta t})$ .

$$ f_{\theta}(x_{t+\delta t}|\theta)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

We replace $\mu = \mathbb{E}(X_{t+\delta t})$ and $\sigma^2 = Var(X_{t+\delta t})$ insider.

Code realisation could be found in notes.

Maximum Likelihood Estimator

Joint Probability Density Function or Likelihood Function:

$f(x_1, x_2, …, x_n|\theta) = f(x_1|\theta) f(x_2|\theta) … f(x_N| \theta) $

$f(x_1, x_2, …, x_n|\theta) = \prod_{i=2}^n f(X_i | \theta) =L(\theta)$

A likelihood function is the density function regarded as a function of $\theta$ .

$ L(\theta |x) = \prod f(x|\theta) $, $\theta \in \Theta$

As we know the sample, we need to ‘guess’ the parameter $\theta$ to let the joint prob distribution have a maximum probability.

The Maximum Likelihood Estimator (MLE) is :

$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) $

To simplify the calculation, we normally apply a logarithm transformation,

$$ l(\theta)= log\ L $$

$$log L(\theta |x) =log \prod f(x|\theta)= \sum_{i=1}^N log\ f_{\theta} (y_i) $$

As the logarithm transformation is monotonous,

$$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) \Leftrightarrow $$

$$\hat{\theta}(x) = arg\max_{\theta} l(\theta|x) $$

To review it, we aim to estimate the parameters. The logic of the MLE estimator is that we guess a probability distribution, fitting into our data (sample). Then, we find or guess the value of parameters such that the joint probability function (likelihood function) achieves the maximum value.

In another word, we guess the probability dist and parameters to make sample data achieve highest possibility.

To find the $argmax_{\theta} l(\theta)$ is equivalent to $argmin_{\theta} -l(\theta)$

A Normal Distribution Example

A normal distribution example,for random sample $x_i$ , i=1, 2, 3, …, N

$$f_{\theta}(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

We substitute $f_{\theta}(x)$ into the log-likelihood distribution, then we find,

$$ l(\mu, \sigma^2)=log\ L(\mu,\sigma^2) $$

$$= -\frac{n}{2}\big( log2\pi + log\sigma^2 \big) -\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i -\mu)^2 $$

Taking the First Order Conditions, F.O.C.

By setting $\frac{\partial l}{\partial \mu}= 0$ and $\frac{\partial l}{\partial \sigma^2}= 0$ , we solve $\mu_{MLE}$ and $\sigma^2_{MLE}$ .

$$ \mu_{MLE} =\bar{x}$$

$$\sigma^2_{MLE} =\frac{1}{n} \sum_{i=1}^n (x_i – \bar{x})$$

P.S. We can prove that there are the local maximum, coz the second partial derivatives are negative.

def MLE_norm(x):
    mu_hat = np.mean(x)
    sigma2_hat = np.var(x)
    return mu_hat, sigma2_hat

#----------

mu = 5
sigma = 2.5
N = 10000

np.random.seed(0)
x = np.random.normal( mu, sigma, size=(N,) )

#----------

mu_hat, sigma2_hat = MLE_norm(x)

for_mu_hat = ' $\hat{\mu} = '+format(round(mu_hat,2))+'$ '
for_sigma2_hat = ' $\hat{\sigma} = '+format(round(np.sqrt(sigma2_hat),2))+'$ '
print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))

By our calculation, we know that mathematically the MLE estimator of mu and sigma are the mean and variance.

Perform MLE Numerically

If the log-likelihood function wasn’t continuous or differentiable. Can solve numerically through an optimisation problem where the objective function is …

In other words, sometimes we cannot calculate the MLE estimator by hand, because the log-likelihood function might not be differentiable. What can we do?

We apply the computer to interactively ‘guess’ values (Optimizer) that can end up with the maximum log-likelihood (or a minimum of negative log-likelihood).

The maximum likelihood estimator (MLE):

$$\hat{\theta}(x) = arg max_{\theta} L(\theta|x)$$

$$L(\theta) = \prod_{i=1}^N f(X_i | \theta) $$

$$ log\ f(X|\theta)=\sum_{i=1}^N log\ f(x_i|\theta)$$

$$ arg\max_{\theta} l(\theta) \Leftrightarrow arg\min_{\theta} -l(\theta) $$

Why we use the minimizer instead? There is only minimizer available in the scipy package, and those mini&max are basically equivalent.

def log_likelihood(theta, x):
    mu = theta[0]
    sigma = theta[1]
    l_theta = np.sum( np.log( sc.stats.norm.pdf(x, loc=mu, scale=sigma) ) )
    # there is a negative sign in front. We minimise it later.
    return -l_theta 

# Constraint Function that Restrict sigma to be positive.
def sigma_pos(theta):
    sigma = np.abs(theta[1])
    return sigma

#--------

cons_set = {'type':'ineq', 'fun': sigma_pos}

theta0 = [2,3] # inital guess. there could be any random numbers, coz it is just the intital value inputed into the optimiser
# find the minimum value of the log likelihood function
opt = sc.optimize.minimize(fun=log_likelihood, x0=theta0, args=(x,), constraints=cons_set)

for_mu_hat = ' $\hat{\mu} = '+format(round(opt.x[0],2))+'$ '
for_sigma2_hat = ' $\hat{\sigma} = '+format(round(opt.x[1],2))+'$ '

print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))

Maximum-Likelihood-Estimator Download

Risk-Neutral Pricing

Our basic logic is,

$$ \text{Option Value} =\text{Discounted Expectation of the Payoff} $$

However, there is not a perfect correlation between the Option and the Underlying Asset. That fact encourage us to do the risk-netural method for valuing an option.

Geometric Brownian Motion

Non-dividend paying stock, ( $S_0>0$ )

$$ dS_t = \mu S_t\ dt+\sigma S_t \ dW_t $$

Money market, or bank account, ( $B_0 = 1$ )

$$ dB_t = r B_t\ d_t $$

and,

$$ \{ W_t \}_{t\in [0,T]} : \mathbb{P} \sim Brownian \ Motion $$

$ \{S_t\} $ is not a martingals under $\mathbb{P}$ because $\mu S_t$ is not zero.

So, we apply a transformation to $S_t$ and to make it be a martingal (eliminate the drift term).

We here look at $\{ \frac{S_t}{B_t} \}$ ,

$$ d\big( \frac{S_t}{B_t} \big)=-r\frac{S_t}{B_t}dt +\frac{1}{B_t} dS_t +\frac{1}{2}dS_t (-r\frac{1}{B_t}dt) $$

$$ d\big( \frac{S_t}{B_t} \big)=-r\frac{S_t}{B_t}dt +\bigg( \frac{1}{B_t}-\frac{1}{2}r\frac{1}{B_t}dt \bigg)\bigg( \mu S_t\ dt +\sigma S_t \ dW_t \bigg) $$

The cross terms dt & dW would decay to zero quickly in the stochastic integration. We therefore would get,

$$ d\big( \frac{S_t}{B_t} \big) = \sigma ( \frac{S_t}{B_t} )\underbrace{\bigg( \frac{\mu – r}{\sigma} dt + dW_t \bigg)}_{d\tilde{W_t}} $$

The drift disappears, instead we get $d\big( \frac{S_t}{B_t} \big)d\tilde{W_t}$

$$ d\big( \frac{S_t}{B_t} \big)d\tilde{W_t} $$

$$ d\tilde{W_t} = \frac{\mu -r}{\sigma} dt +dW_t $$

by Girsanov’s Theorem that $\{a_t \}_{t\in [0,T]}$ be and adapted $\{ \mathcal{F} \}_{t\in [0,T]}$ Ito Process, so that $\mathbb{Q}$ is an equivalent measure on $\mathbb{P}$ such that $\tilde{W_t}$ is a Brownian Motion, $\tilde{W_t} = W_t +\int_0^t a_u du$ .

$$ dS_t = rS_t \ d_t +\sigma S_t \ d\tilde{W_t} $$

Implication

The implication is that,

We can construct a relicating portfolio $(S_t, B_t)$ with value process $\{ V_t \}_{t\in[0,T]}$ . Under $\mathbb{Q}$ the discounted value process exists that is a martingale $\{ e^{-rt}V_t \}$ .

$$ e^{-rt} C_t =\mathbb{E}_{\mathbb{Q}}[ e^{-rT} C_T | \mathcal{F}_t ] $$

$$ =\mathbb{E}_{\mathbb{Q}}[ e^{-rT} V_T | \mathcal{F}_t ] = e^{-rt} V_t $$

Risk Neutral Expectation Pricing Formula

We therefore get,

$$ \frac{C_t}{B_t} = \mathbb{E}_{\mathbb{Q}}\bigg[ \frac{C_T}{B_t} \bigg| \mathcal{F}_t \bigg] $$

$$ B_T=1,\quad and \quad B_t=exp\{ \int_0^t r_s ds \} $$

Tensorflow – Keras – beginner-Example

Tensorflow-Keras-beginner-example Download

ABOUT ME

Author: Fanyu Zhao