Maximum Likelihood Estimator

Joint Probability Density Function or Likelihood Function:

$f(x_1, x_2, …, x_n|\theta) = f(x_1|\theta) f(x_2|\theta) … f(x_N| \theta) $

$f(x_1, x_2, …, x_n|\theta) = \prod_{i=2}^n f(X_i | \theta) =L(\theta)$

A likelihood function is the density function regarded as a function of \theta.

$ L(\theta |x) = \prod f(x|\theta) $, $\theta \in \Theta$

As we know the sample, we need to ‘guess’ the parameter \theta to let the joint prob distribution have a maximum probability.

The Maximum Likelihood Estimator (MLE) is :

$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) $

To simplify the calculation, we normally apply a logarithm transformation,

$$ l(\theta)= log\ L $$

$$log L(\theta |x) =log \prod f(x|\theta)= \sum_{i=1}^N log\ f_{\theta} (y_i) $$

As the logarithm transformation is monotonous,

$$ \hat{\theta}(x) = arg\max_{\theta} L(\theta|x) \Leftrightarrow $$

$$\hat{\theta}(x) = arg\max_{\theta} l(\theta|x) $$

To review it, we aim to estimate the parameters. The logic of the MLE estimator is that we guess a probability distribution, fitting into our data (sample). Then, we find or guess the value of parameters such that the joint probability function (likelihood function) achieves the maximum value.

In another word, we guess the probability dist and parameters to make sample data achieve highest possibility.

To find the argmax_{\theta} l(\theta) is equivalent to argmin_{\theta} -l(\theta)

A Normal Distribution Example

A normal distribution example,for random sample x_i, i=1, 2, 3, …, N

$$f_{\theta}(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

We substitute f_{\theta}(x) into the log-likelihood distribution, then we find,

$$ l(\mu, \sigma^2)=log\ L(\mu,\sigma^2) $$

$$= -\frac{n}{2}\big( log2\pi + log\sigma^2 \big) -\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i -\mu)^2 $$

Taking the First Order Conditions, F.O.C.

By setting \frac{\partial l}{\partial \mu}= 0 and \frac{\partial l}{\partial \sigma^2}= 0, we solve \mu_{MLE} and \sigma^2_{MLE}.

$$ \mu_{MLE} =\bar{x}$$

$$\sigma^2_{MLE} =\frac{1}{n} \sum_{i=1}^n (x_i – \bar{x})$$

P.S. We can prove that there are the local maximum, coz the second partial derivatives are negative.

def MLE_norm(x):
    mu_hat = np.mean(x)
    sigma2_hat = np.var(x)
    return mu_hat, sigma2_hat

#----------

mu = 5
sigma = 2.5
N = 10000

np.random.seed(0)
x = np.random.normal( mu, sigma, size=(N,) )

#----------

mu_hat, sigma2_hat = MLE_norm(x)

for_mu_hat = '\hat{\mu} = '+format(round(mu_hat,2))+''
for_sigma2_hat = '\hat{\sigma} = '+format(round(np.sqrt(sigma2_hat),2))+''
print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))

By our calculation, we know that mathematically the MLE estimator of mu and sigma are the mean and variance.

Perform MLE Numerically

If the log-likelihood function wasn’t continuous or differentiable. Can solve numerically through an optimisation problem where the objective function is …

In other words, sometimes we cannot calculate the MLE estimator by hand, because the log-likelihood function might not be differentiable. What can we do?

We apply the computer to interactively ‘guess’ values (Optimizer) that can end up with the maximum log-likelihood (or a minimum of negative log-likelihood).

The maximum likelihood estimator (MLE):

$$\hat{\theta}(x) = arg max_{\theta} L(\theta|x)$$

$$\hat{\theta}(x) = arg max_{\theta} L(\theta|x)$$

$$L(\theta) = \prod_{i=1}^N f(X_i | \theta) $$

$$ log\ f(X|\theta)=\sum_{i=1}^N log\ f(x_i|\theta)$$

$$ arg\max_{\theta} l(\theta) \Leftrightarrow arg\min_{\theta} -l(\theta) $$

Why we use the minimizer instead? There is only minimizer available in the scipy package, and those mini&max are basically equivalent.

def log_likelihood(theta, x):
    mu = theta[0]
    sigma = theta[1]
    l_theta = np.sum( np.log( sc.stats.norm.pdf(x, loc=mu, scale=sigma) ) )
    # there is a negative sign in front. We minimise it later.
    return -l_theta 

# Constraint Function that Restrict sigma to be positive.
def sigma_pos(theta):
    sigma = np.abs(theta[1])
    return sigma

#--------

cons_set = {'type':'ineq', 'fun': sigma_pos}

theta0 = [2,3] # inital guess. there could be any random numbers, coz it is just the intital value inputed into the optimiser
# find the minimum value of the log likelihood function
opt = sc.optimize.minimize(fun=log_likelihood, x0=theta0, args=(x,), constraints=cons_set)

for_mu_hat = '\hat{\mu} = '+format(round(opt.x[0],2))+''
for_sigma2_hat = '\hat{\sigma} = '+format(round(opt.x[1],2))+''

print('The MLE for data is:')
display(Latex(for_mu_hat))
display(Latex(for_sigma2_hat))

Risk-Neutral Pricing

Our basic logic is,

$$ \text{Option Value} =\text{Discounted Expectation of the Payoff} $$

However, there is not a perfect correlation between the Option and the Underlying Asset. That fact encourage us to do the risk-netural method for valuing an option.

Geometric Brownian Motion

  • Non-dividend paying stock, (S_0>0)

$$ dS_t = \mu S_t\ dt+\sigma S_t \ dW_t $$

  • Money market, or bank account, (B_0 = 1)

$$ dB_t = r B_t\ d_t $$

and,

$$ \{ W_t \}_{t\in [0,T]} : \mathbb{P} \sim Brownian \ Motion $$

$ \{S_t\} $ is not a martingals under \mathbb{P} because \mu S_t is not zero.

So, we apply a transformation to S_t and to make it be a martingal (eliminate the drift term).

We here look at \{ \frac{S_t}{B_t} \},

$$ d\big( \frac{S_t}{B_t} \big)=-r\frac{S_t}{B_t}dt +\frac{1}{B_t} dS_t +\frac{1}{2}dS_t (-r\frac{1}{B_t}dt) $$

$$ d\big( \frac{S_t}{B_t} \big)=-r\frac{S_t}{B_t}dt +\bigg( \frac{1}{B_t}-\frac{1}{2}r\frac{1}{B_t}dt \bigg)\bigg( \mu S_t\ dt +\sigma S_t \ dW_t \bigg) $$

The cross terms dt & dW would decay to zero quickly in the stochastic integration. We therefore would get,

$$ d\big( \frac{S_t}{B_t} \big) = \sigma ( \frac{S_t}{B_t} )\underbrace{\bigg( \frac{\mu – r}{\sigma} dt + dW_t \bigg)}_{d\tilde{W_t}} $$

The drift disappears, instead we get d\big( \frac{S_t}{B_t} \big)d\tilde{W_t}

$$ d\big( \frac{S_t}{B_t} \big)d\tilde{W_t} $$

$$ d\tilde{W_t} = \frac{\mu -r}{\sigma} dt +dW_t $$

by Girsanov’s Theorem that \{a_t \}_{t\in [0,T]} be and adapted \{ \mathcal{F} \}_{t\in [0,T]} Ito Process, so that \mathbb{Q} is an equivalent measure on \mathbb{P} such that \tilde{W_t} is a Brownian Motion, \tilde{W_t} = W_t +\int_0^t a_u du.

$$ dS_t = rS_t \ d_t +\sigma S_t \ d\tilde{W_t} $$

Implication

The implication is that,

We can construct a relicating portfolio (S_t, B_t) with value process \{ V_t \}_{t\in[0,T]}. Under \mathbb{Q} the discounted value process exists that is a martingale \{ e^{-rt}V_t \}.

$$ e^{-rt} C_t =\mathbb{E}_{\mathbb{Q}}[ e^{-rT} C_T | \mathcal{F}_t ] $$

$$ =\mathbb{E}_{\mathbb{Q}}[ e^{-rT} V_T | \mathcal{F}_t ] = e^{-rt} V_t $$

Risk Neutral Expectation Pricing Formula

We therefore get,

$$ \frac{C_t}{B_t} = \mathbb{E}_{\mathbb{Q}}\bigg[ \frac{C_T}{B_t} \bigg| \mathcal{F}_t \bigg] $$

$$ B_T=1,\quad and \quad B_t=exp\{ \int_0^t r_s ds \} $$

Simulating Geometric Brownian Motion (GBM)

Here is a realisation of the previous blog, A bit Stochastic Calculus.

Implications

  • 1. Stochastic Integral is different from Ordinary Integral, so as to differentiation. The Stochastic Differential Equation, S.D.E. we normally use to mimic the movement of a stock is as the following,

$$ d S=\mu S d + \sigma S dW$$

, where dW follows a Brownian Motion.

The Brownian Motion also has the following critical properties:

  1. Martingale: E(S_{t+k}|F_t)=E(S_t|F_t) for k>1.
  2. Quadratic Variation: \sum_{i=1}^n (S_i-S_{i-1})^2 \to t
  3. Normality: An increment of S_{t+dt} is normally distributed, with mean zero and variance t_i – t_{i-1}
  • 2. What does Ito’s Lemma tell us?

If the stock price, S_t, follows a stochastic process, then the financial contracts F(S_t), as a function of the underlying stock price, follow another stochastic process.

  • 3. Taylor Expansion helps

See notes about Paul Willmott’s book’s paragraph.

  • 4. Different Forms of Stochastic Processes could be applied.

Example 1: dS=\mu \ dt + \sigma \ dX

Example 2: dS=\mu S \ dt +\sigma S \ dX

For this, dV = \frac{\partial V}{\partial t}dt+\frac{\partial V}{\partial S}dS+\frac{1}{2}\sigma^2 S^2 \frac{\partial^2 V}{\partial S^2}dt, which is also called the Geometric Brownian Motion (GBM). If in a specific form of value function V=F(S)=ln(S), then dF = (\mu -\frac{1}{2}\sigma^2)dt + \sigma \ dX. See derivation in attached notes.

Example 3: A mean-reverting random walk. dS=(v-\mu S)dt +\sigma dX

Example 4: Another mean-reverting r.w. dS=(v-\mu S)dt +\sigma S^{1/2}dX

The stochastic term is altered compared with example 3. Now if S ever gets close to zero the randomness decreases,

A bit Stochastic Calculus

See the HTML file for full detals.

Key Takeaways

  • Property of \{W_t\}:
  1. $ W_t – W_s \sim N(0, t-s) $
  2. $(W_t – W_{t-1})$ and $W_{t-i} – W_{t-i+1}$ are uncorrelated. So, $\int_0^t dW_u = \sum^t dW_u = W_t$
  • For,

$$ dS_t = \mu S_t \ dt +\sigma S_t \ dW_t $$

Why the Geometric Brownian Motion of \{S_t\} is designed in that form?

The answer might be,

$$ dS_t /S_t = \mu \ dt +\sigma \ dW_t $$

$$\int_0^t dS_t /S_t = \int_0^t \mu \ dt + \int_0^t \sigma \ dW_t $$

$$ log(S_t) = \mu \ t +\sigma \ W_t $$

Taking the first difference is similar to differentiation. (d(log(S_T)) = log(S_t/S_t-1) = log(1+r_t) \ approx r_t).

$$ r_t = \mu + \sigma \ \Delta W_t $$

The return, \{r_t\}, is equal to a mean, \mu, plus a stochastic term. That is a random walk.

  • We apply a transformation, if S_t follows a Geometric Brownian Motion, then f(S_t) follows another,

In calculating d f(S_t), we would get, (by Taylor Expansion)

$$ df(S_t) = \bigg( \frac{\partial f}{\partial t} + \frac{\partial f}{\partial S_t}\mu S_t +\frac{1}{2}\frac{\partial^2 f}{\partial S_t^2}\sigma^2 S_t^2 \bigg)dt + \frac{\partial f}{\partial S_t}\sigma S_t \ dW_t $$

  • A Special Form of f(\cdot) is f(S) = log(S),

$$ d\ log(S_t) = \bigg( \mu – \frac{1}{2}\sigma^2 \bigg)dt + \sigma \ dW_t $$

Integrate the above equation from time 0 to time t, then we would get,

$$ log(S_t) = log(S_0) + (\mu – \frac{1}{2}\sigma^2)t + \sigma W_t $$

Brownian Motion to Normal Distribution

Codes are shown in the HTML file.

Markov Property

The Markov Property states that the expected value of the random variable \(S_i\) conditional upon all of the past events only depends on the previous value \(S_{i−1}\). The implication of the Markov Property is that the

Martingale Property

best expectation of future price, \((S_{t+k}\), is the current price, \(S_t\). The current price contains all the information until now.

$$ \mathbb{E}(S_{t+k}|F_t)=S_t, \text{ for k>=0} $$

That is called the Martingale Property.

Quadratic Variation

The Quadratic Variation is defined as,

$$ \sum_{i=1}^k (S_i – S_{i-1})^2 $$

In the coin toss case, each movement must be either “1” or “-1”, so \(S_i – S_{i-1} = \pm 1\). Thus, \((S_i – S_{i-1})^2 =1\) for all “i”. Therefore, the Quadratic Variation equals “k”.

A Realisation

Here below is the Code Realisation of a Brownian motion – Chapter 5.6 Paul Wilmott Introduces Quantitative Finance.

ARCH and GARCH

Let’s begin with the ARCH model.

ARCH Model

The ARCH model was initially raised by Engle (1982), and the ARCH model means the Autoregressive Conditional Heteroskedasticity model.

We assume here \(u_t\) is the return.

$$ u_t=\frac{P_t-P_{t-1}}{P_{t-1}} $$

$$u_t\sim N(0,\sigma_t^2)$$

The data-generating process (DGP) is like an AR form, as the name of ARCH. The volatility is autoregressively generated by \(u^2_i\).

$$\sigma_t^2=\delta_0+\sum_{i=1}^{p} \delta_i u_{t-i}^2$$

, where \(p\) is the number of lags, and \(\delta_i\) are a set of parameters. The DGP of that model shows that the volatility of the return is heteroscedastic, correlated with the squared term of the return per se.

For example, an ARCH(1) model is like,

$$ \sigma_t^2=\delta_0+\delta_1 u^2_{t-1} $$

  • Stationarity

Note here we need our time series to be stationary for better forecasting. Thus, \(Var(u_t)=\sigma^2 \)

$$ Var(u_t)=\delta_0+\delta_1 Var(u_{t-1}) $$

$$ \sigma^2=\frac{\delta_0}{1-\delta_1} $$

As the variance has to be positive. We need \(\delta_0 > 0\), and \(\delta_1<1\).

  • Estimation

For this time series data, OLS assumptions are violated, because our series are autoregressive heteroskedasticity.

Instead, the Maximum Likelihood Estimation (MLE) would be a better estimation method by assuming the probability distribution of variables.

MLE allows iterations to find parameters \(\delta\) that can maximise the maximum likelihood function.

GARCH Model

The ‘G’ in the GARCH model means ‘generalised’, and the GARCH model has a set of additional terms, \(\sum \gamma_i \sigma^2_i \). Thus, the DGP of the GARCH(p,q) model is as the following,

$$u_t\sim N(0,\sigma_t^2)$$

$$ \sigma_t^2=\delta_0 + \sum_{i=1}^{p} \delta_i u^2_{t-i} +\sum^q_{j=1} \gamma_j \sigma^2_{t-j} $$

ARMA-GARCH Model

That is a further application, in which the GARCH model is applied to mimic the movement of error terms in the ARMA model.

We initially assume an ARMA(p,q) model,

$$ y_t=\beta_0 +\sum^p_{i=1} \beta_i y_{t-i} + \sum^{q}_{j=1} \theta_j u_{t-j} +u_t$$

Then, we assume the error term here, \(u_t \sim GARCH(m,n)\).

$$ u_t \sim N(0,\sigma_t^2)$$

$$ \sigma_t^2 = \delta_0 +\sum^m_{i=1} \delta_i u_{t-i}^2 +\sum_{j=1}^n \gamma_n \sigma_{t-n}^2 $$

Reference

Engle, R.F., 1982. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica: Journal of the econometric society, pp.987-1007.

近期经济观察

IMF Blog: Policymakers Need Steady Hand as Storm Clouds Gather Over Global Economy, by Pierre-Olivier Gourinchas, illustrated recent facts and economic data. Here is a brief presentation.

The world economic slowdown continues, Growth forecast in China is adjusted downward to 4.4% due to the weakening property sector and continued lockdowns by IMF.

Inflation

Rapidly rising prices, especially of food and the economy are causing serious hardship for households.

Inflation in China seems not that obvious, and the biggest problem, instead, is the slag of econ growth. Fortunately, wages are not increasing that much in most of the world economies. The hyperinflation raised from spikes in prices and wages is under control.

FOMC’s schedule of interest rate increases is expected to slow down, 50bp, in December.

The Chinese government is conducting loosening fiscal policy, stimulating the aggregate demand in part, and increasing the fiscal spending on infrastructure across countries through local financing platforms. The balance sheet is enlarging government leverage rate increases. The strong government is trying to raise confidence in individuals. However, monetary policy is limited due to the QT in the U.S. The exchange rate would be largely damaged if there were CB QE.

Energy Price

Food prices and energy prices surge to a high level and are expected to keep high as IMF expected.

OPEC decreases crude oil supply by 2 million barrer per day. IEA reported that US, as the target of global economic policy, released 15 million barrel of strategic crude oil reserve. US strategic crude oil reserve is about 401 million barrel.

OPEC和US对于石油价格博弈可能要追溯到2020年前,美国页岩油技术在2020年油价暴跌大背景下开始难以盈利。wait for further study.

Strong USD

The strength of USD is a major challenge for many emerging market.

The appropriate response in most emerging and developing countries is to calibrate monetary policy to maintain price stability, while letting exchange rates adjust, conserving valuable foreign exchange reserves for when financial conditions really worsen.

In the tradeoff between the economic growth and the inflation, inflation seems become a piori target. The economy still has a bottom support, rigid demands such as foods, energy, accommodations etc. However, inflation is linked with the nomial term and could be even worsen off.