近期经济观察

Credibility Affects How the Market Reacts

Preliminary: 货币政策 & 财政政策宽松,刺激经济短期,带来短期需求增加。M2 increases results in CPI increases. 但是:

  • 1. 长期随着价格水平增加,output会回到 Y^*.
  • 2. Inflation is nothing but a monetary phenomenon. Too much money chases too few goods and services, and then there would be inflation. 当货币超发,中期, Goods & Services的增加不足以匹配超发的货币时,通胀就产生了。
  • P.S. 中期与否,取决于市场对资金的需求。Unrealistic Case 1:若为长期,人们仍然预期,未来的goods n services足以匹配过多的money,可能也不会带来Price Level的增加(“Technological Growth” Case)。Unrealistic Case 2:也可能短期,甚至超短期,Private Sectors 嫉妒悲观,不相信任何public sector’s credibility,不吸纳任何money,不用任何moeny去买goods & service,哪怕有helicopter drops。那么此时,print money means create inflation directly,印钱会在超短期就带来通胀!

Based on the previous assumption: One factor that affects how the market reacts to QE is the Credibility of the Public Section.

Where the Credibility comes from? It’s from the Stability of a State, People’s expectations about future development, Demographic Structure, Gov’s Leverage Rate and Implicit Leverage rate, and so on. Because the Public sector can do Ponzi, and renew their debts indefinitely.

After Inflation

We assume that the public sector is credible and would not default, and the stability and social structures are still healthy. Then, QE result in too much money chasing too few goods & services, and inflation emerges.

After the inflation, how do people react?

  • 1. Pour money into the Heavy-capitalised assets, such as real estate and ‘equity’ shares. Then Asset Price increases to absorb extra money. The advantage is that goods and services (necessities) might not appreciate a lot. The disadvantage is that asset price increase is still not a healthy phenomenon, and systematic troubles accumulate because the bubble collapses one day. If the Bubble Collapses, the CB must reduce liquidity (money). Let’s move case 2 then.
  • 2. QT. However, QT might be too dangerous, because high financial costs would increase financial burdens for highly leveraged households and firms. Defaults and Disposal Sales increase. Asset’s price slips. The current financial markets are highly securitised and synthetic. A decrease in core assets would result in a series of dangerous chain reactions. Potential Systematic Risks come true.
  • 3. Technological Growth brings extra demands or something else, to absorb extra money.
  • 3. However, Technological Growth might be ambiguous, and is highly unlikely to happen, at least in the current situation. It is only a potential solution and is less likely to happen.
  • If case 3 not working, we have to move back to case 2.
  • Withdraw Liquidity (money), and kill inflation. Asset Prices, such as housing prices and stock prices, evaporate. In this case, necessities’ prices are less likely to be affected because they are rigid. Housing Price is a large component of the CPI basket, reducing the price of assets might be a possibility to reduce inflation. However, the gov might choose to keep a relatively sustainable and stable recession by maintaining the housing price. House has a special function, living. So, killing the equity market, in which stock prices are not that directly linked with household living, might be the best choice for the gov.

The first case might be more effective because problems are left to future generations.

中国市场

Not in the same economic cycle as the U.S. market. There could be a reverse in the short term and even in the mid-term because the pandemic policies are forced to be stopped. However, I don’t have a positive expectation toward the market in the long run for mainly two reasons. 1. leverage increase in the local government. 2. credibility decreases. Those two reasons interact to make the problem even more severe.

Gradient / Derivative in Python

By definition:

$$ \nabla f(x_1, x_2) =\frac{\partial f}{\partial x_1} + \frac{\partial f}{\partial x_1} $$

$$ \frac{\partial f}{\partial x_1} = \lim_{h\rightarrow 0}\frac{f(x+h)-f(x)}{h} $$

$$ \frac{\partial f}{\partial x_1} = \lim_{h\rightarrow 0}\frac{f(x+h)-f(x-h)}{2h} $$

  • Code:
func_1 = lambda x: x**2 +5
func_2 = lambda x: x[0]**2 + x[1]**3 +1
x_lim = np.arange(-5,5,0.01)
input_val = np.array([2.0,3.0])


class Differentiate:
    def __init__(self):
        self.h = 1e-5
        self.dx = None
        
    def d1_diff(self, f, x):
        fh1 = f(x+self.h)
        fh2 = f(x-self.h)
        self.dx = (fh1 - fh2)/(2*self.h)
        return self.dx
        
    def tangent(self, series, f, x_loc):
        """
        Return a Tangent Line, for ploting.
        """
        y_loc = f(x_loc)
        self.d1_diff(func_1, x_loc)
        b = y_loc - self.dx * x_loc
        y_series = self.dx * series + b
        return y_series
    
    # for f(x1, x2, x_3, ...)
    def dn_diff(self, f, x):
        grad = np.zeros_like(x)
        for i in range(len(x)):
            temp_val = x[i]
            x[i] = temp_val + self.h
            fxh1 = f(x)
            x[i] = temp_val - self.h
            fxh2 = f(x)
            grad[i] = (fxh1 - fxh2) / (2*self.h)
            x[i] = temp_val
            self.dx = grad
        return self.dx
    
    def gradient_descent(self, f, init_x, lr = 0.01, step_num = 1000):
        x = init_x   
        for i in range(step_num):
            self.dn_diff(f, x)
            x -= lr * self.dx
        return x

Optimiser

In the Neural Network, we optimise (minimise) the loss by choosing weights matrix, W and b.

$$ arg\max_{W, b} Loss $$

By taking the F.O.C., we get the gradient, \nabla. Then, we update weights by minus the learning rate times the gradient.

Here, I wrote some test algorithms to find how optimisers evolve.

Preparation

Let’s make some preparations and notation clarifications.

Notation:

  • Y is True,
  • T is the Estimate.

$$Y = XW + b + e$$

Let T = WX +b

$$ loss = e^T e $$

$$ loss = (Y-T)^T \cdot (Y-T) $$

$$= (WX+b+e-T)^T \cdot (WX+b+e-T) $$

$$ \frac{\partial Loss}{\partial W} = X^T (WX+b+e-T) = 2X^T (Y-T) $$

$$ \frac{\partial loss }{\partial b} = 1^T (WX+b+e-T) = 1^T (Y-T)$$

Just One More Thing We Consider Here!

We do not want the number of sample to affect loss function’s value, so we take averge, instead of sumation.

$$ \frac{\partial Loss}{\partial W} = \frac{2}{N}\ X^T (Y-T) \quad, \quad \frac{\partial loss }{\partial b} = \frac{1}{N} \ 1^T (Y-T) $$

Optimisers

Gradient Descent

$$ W_t = W_{t-1} – \eta \nabla_W $$

$$ b_t = b_{t-1} – \eta \nabla_b $$

, where \eta is the learning rate.

However, we may find that the Gradient Descent might not satisfactory, because gradients might not be available or vanished sometimes. Also, the process of gradient descent largely depends on the HyperParameter \eta. Some other methods of optimisation are developed.

Different Optimisers aim to find the optimal weights more speedy and more accurate. In the followings are how optimisers are designed.

Stochastic Gradient Descent – SGD

In SGD, the stochastic part is added to avoid model train full dataset, and avoid resulting in the problem of overfitting.

Momentum SGD

A momentum term is added.

  • Apply Momentum to the Gradient, \nabla.

$$\text{Gradient Descent:} \quad w_t = w_{t-1} -\eta g_w$$

$$ \text{Momentum:}\quad v_t = \beta_1 v_{t-1} + (1-\beta_1)g_w $$

In the Beginning of Iteration, v_0=0, so we amend it to be \hat{v_T}

$$ \hat{v_t} = \frac{v_t}{1-\beta_1^t} $$

,where \beta_1 assign weights between previous value and the gradient.

Replace the gradient g_w by the amended momentum term \hat{v}_t:

$$ w_t = w_{t-1} -\eta \hat{v}_t $$

P.S. Why we amend v_t by dividing 1 + \beta^t.

As v_t = \beta_1 v_{t-1} + (1-\beta_1)g_w, which is like a geometric decaying polynomial function. The difference is that w and g_w keep updating each period. Let’s assume there is no updating anymore, or in other word, model has converge. g_w = constant. \{ v_t \}_{t=0}^T would be,

$$ v_0 = 0 $$

$$ v_1 = (1-\beta_1)g_w $$

$$ v_2 = \beta_1 (1-\beta_1)g_w + (1-\beta_1) g_w $$

$$ v_3 = \beta_1^2(1-\beta_1)g_w + \beta_1 (1-\beta_1)g_w + (1-\beta_1)g_w $$

$$ v_n = \beta_1^{n-1}(1-\beta_1)g_w + … + (1-\beta_1)g_w $$

$$ v_n = (1-\beta_1)g_w (1+\beta_1 + … +\beta_1^{n-1}) $$

$$ v_n = (1-\beta_1)g_w \frac{1 – \beta_1^n}{1-\beta_1} =g_w (1-\beta_1^n)$$

Clearly, to amend v_n be like g_w, we need to divide it by 1-\beta_1^n, because of the polynomial ‘geometric’ form.

RMSprop

  • Apply a Transformation to the Gradient, \nabla.

$$\text{Gradient Descent:} \quad w_t = w_{t-1} -\eta g_w$$

$$ \text{RMS:}\quad m_t = \beta_1 v_{t-1} + (1-\beta_2)g^2_w $$

In the Beginning of Iteration, v_0=0, so we amend it to be \hat{v_T}, (remember there is a power t here).

$$ \hat{m_t} = \frac{m_t}{1-\beta_1^t} $$

,where \beta_1 assign weights between previous value and the gradient.

Add a square root of \hat{m}_t in the denominator:

$$ w_t = w_{t-1} -\eta \frac{g_w}{\sqrt{\hat{m}_t}} $$

Adam

Adam is nearly the most powerful optimiser so far. Adam is like a combination of Momentum and RMS, both \hat{m} and \hat{v} are added to adjust the speed of approaching to the optimal points, W^*, and b^*.

$$ w_t = w_{t-1} – \eta \cdot \frac{ \hat{v}_t }{\sqrt{\hat{m}_t}} $$

Three main hyper parameters are \eta = 0.01, \beta_1 = 0.9, and \beta_2 = 0.999.

Code

Thinking about ML

Learning from Deep Learning from Scratch. Let’s me share some thinking and ideas.

  • 1. Neural Network, multi-layer affine and activation function work great with non-linear transformation. Results are boosted to higher dimension.
  • 2. The combination between NN and loss minimisation make backward propagation applicable. Each little step of movements are the result of previous scholars’ research. That makes me feel closely how technologies and theories are developed, and how well statistics and mathematics are applied.
  • 3. Theories are inherited from previous research. Every year and every month, there are researchers from different countries, from different institutions, universities or high-tech firms, publish brilliant papers that are highly shocked and astonishing to further development.
  • 4. The current development of CS or ML are really top-notch and inspiring. Hardware, CPU & GPU & Cloud provide speedy efficiency of calculation, linear algebra works well on the Computer ; algorithms are innovative and update vastly. Hardware and Software together improve the ability of prediction.
  • 5. Although good for predicting, Deep Learning algorithms, NN, is still a Black Box, which means one could not explain the reason of getting the result. The bridge between Input <-> Output are weights, and weights are explained by backward propagation. However, is that pattern true? Is that a Correlation or Causality?

Deep Learning from Scratch

1. Perceptron

1.1 AND & OR & ‘N’ Gate

Linear Classification might be applied by AND & OR.

import numpy as np

def AND(x1, x2):
    x = np.array([x1,x2])
    w = np.array([0.5, 0.5])
    b = -0.7
    temp = np.sum(w*x )+ b
    if temp<= 0:
        return 0
    else:
        return 1
    
def NAND(x1,x2):
    x = np.array([x1, x2])
    w = np.array([-0.5, -0.5])
    b = 0.7
    temp = np.sum(w*x)+b
    if temp <= 0:
        return 0
    else: 
        return 1

def Nand(x1, x2): # same as above
    return int(not bool(AND(x1,x2)))
    
    
def OR(x1,x2):
    x = np.array([x1, x2])
    w = np.array([0.5, 0.5])
    b = -0.2
    temp = np.sum(w*x) + b
    if temp <=0:
        return 0
    else:
        return 1

1.2 XOR Gate – Apply More than Two Perceptrons to Achieve Non-Linear Classification

def XOR(x1, x2):
    s1 = NAND(x1, x2)
    s2 = OR(x1,x2)
    y = AND(s1, s2)
    return y
  • It has been proved that any functions could be represented by a combination of 2 perceptrons with sigmoid as the activation function.

Apply Multi – Layers of Perceptrons, we can achieve the simulation of any non-linear transformation. An important part is that Activation Function is necessary to be inserted into different layers. It can be easily proved that:

$$ W_2\Bigg( \big(W_1 X + b_1 \big) \Bigg) + b_2 \equiv W^* x + b^*$$

Multi-layer of Linear Transformation is still Linear Transformation. So, multi-layer of perceptrons becomes useless without Activation Function.

However, if activation function, in other word a non-linear transformation, is added between layers, then the neural network could mimic any non-linear function.

  • Therefore, what the Deep Learning is doing is to apply multi-layer perceptrons to mimic the pattern of a certain thing. Some key points are
    • (1) multi-layer perceptrons are named the Neural Network. Each adjacent layers are doing linear transformation WX + b and then apply a activation function, such as Sigmoid(x) = \frac{1}{1+e^{-x}};
    • (2) Update the weight matrix W and b, until the neural network could output an “optimal” result.
  • (1) How to define the layers and network structure, (2) how to evaluate the result is “Optimal” or not, and (3) how to find the weights are the main problems in Deep Learning.

2. Network Structure – Forward Propagation

Let’s start with the first Question. How to define the layers and network structure. Through the Network,

  1. we input Data, X, initially.
  2. Data are transformed by ( Perceptrons – “Affine”, Activation Functions ) multi-times, which are the multi-layers.
  3. Then, the results are passed through a Softmax function to reform results into percentages.
  4. Finally, a Loss is calculate.

$$ x \to f(.) \to a(.) \to \text{…} \to Softmax(.) \to Loss(.)$$

, where f(x) = xw + b

, where a(x) is the activation function.

, where Loss(x) is the loss function.

2.1 Affine – Linear Transformation

$$ f(X) = X \cdot W +b $$

2.2 Activation Function

  • ReLU, f(x) = max(x, 0
  • Sigmoid, f(x) = \frac{1}{1+e^{-x}}
  • tanh

There are some properties of activation function, such as symmetric to the zero point, differentiable, etc. Those details are not discuss here.

2.3 Softmax

$$ \vec{X} = (x_1, …, x_i, ..) $$

$$ Softmax(\vec{X}) = \vec{Y} = (…, \frac{e^{x_i}}{\sum e^{x_i}} ,…)^T $$

Inputs are reform to be percentages, and those percentages are sum to be one.

P.S.

$$ y_k = \frac{e^{a_k}}{\sum_{i=1}^{n}e^{a_i} } = \frac{C e^{a_k}}{C \sum_{i=1}^{n}e^{a_i} } $$

$$ = \frac{e^{a_k+ lnC} }{\sum_{i=1}^{n}e^{a_i + lnC} } $$

$$ = \frac{e^{a_k – C’} }{\sum_{i=1}^{n}e^{a_i – C’} } $$

为了防止溢出,事先把x减去最大值。最大值是有效数据,其他值溢不溢出可管不了,也不关心。

2.4 Loss Function

  • Cross Entropy, L = -\sum t_i \ ln(y_i) = – ln(\vec{Y}) \cdot \mathbf{t}^T
  • MSE, See “Linear Regression Paddle” note. E = \frac{1}{2} \sum_k (y_k – t_k)^2

3. Minimise Loss & Update Weights

We consider the “Optimal” weights are such that they can result in minimum loss. So, in other words, we aim to find w and b that can minimise the loss function.

How to Do It ?

$$ arg\min_{w, b} Loss $$

$$ \hat{w}: \frac{\partial L}{\partial w} = 0 ,\quad \hat{b}: \frac{\partial L}{\partial b} = 0$$

Remember, the pathway:

$$ x \to f(.) \to a(.) \to \text{…} \to Softmax(.) \to Loss(.)$$

To solve the `F.O.C, we need to apply the Chain rule backward, which is called the Backward Propagation.

By Chain Rule,

$$ \frac{\partial Loss(w)}{\partial w} = \frac{\partial Loss(.)}{\partial Softmax} \cdot \frac{\partial Softmax(.)}{\partial a} \cdot \frac{\partial a}{\partial f} \cdot \frac{\partial f}{\partial a} … \cdot \frac{\partial f}{\partial w} $$

Let’s Decompose each Part in the Following.

3.1 Loss

$$ \frac{\partial Loss(\vec{Y}, \vec{t})}{\partial \vec{Y}} $$

  • Cross Entropy

$$ L = -\sum t_i \ ln(y_i) = – ln(\vec{Y}) \cdot \mathbf{t}^T$$

$$\frac{\partial L}{\partial y_i} = – \frac{t_i}{y_i}$$

$$\frac{\partial L}{\partial \vec{Y}} = – \big(…,\frac{t_i}{y_i},…\big)^T$$

3.2 Softmax

$$ \vec{X} = (x_1, …, x_i, ..) $$

$$ Softmax(\vec{X}) = \vec{Y} = (…, \frac{e^{x_i}}{\sum e^{x_i}} ,…)^T $$

$$ \frac{\partial Softmax(\vec{X})}{\partial \vec{X}} = Diag(\vec{Y}) – Y\cdot Y^T $$


Deviation is as the Following,

https://blog.csdn.net/Wild_Young/article/details/121912675

$\frac{\partial Y}{\partial X}$:

$$Softmax(x) = \frac{1}{1+e^{-x}}$$

Let X be a vector with shape = (n,1), and
$Y = Softmax(X) $.

That means, for each element of y_i in Y

$$ y_k = \frac{-e^{x_k}}{\sum_i^N e^{x_i}} $$

So,

$$ \frac{\partial Y}{\partial x_k} $$

If j = i,

$$ \frac{\partial y_j}{\partial x_i} = \frac{\partial }{\partial x_i} \Bigg( \frac{e^{x_i}}{\sum e^{x_i}} \Bigg) $$

$$ = \frac{ (e^{x_i})’\sum e^{x_i} – e^{x_i} (\sum e^{x_i})’ }{\big( \sum e^{x_i} \big)^2} $$

$$ =\frac{e^{x_i}}{\sum e^{x_i}} – \frac{e^{x_i}}{\sum e^{x_i}} \cdot \frac{e^{x_i}}{\sum e^{x_i}}$$

$$= y_i – y_i^2 =y_i(1-y_i)$$

If j \neq i,

$$ \frac{\partial y_j}{\partial x_i} = \frac{\partial }{\partial x_i} \Bigg( \frac{e^{x_j}}{\sum e^{x_i}} \Bigg) $$

$$ = \frac{-e^{x_i} e^{x_j}}{(\sum e^{x_i})^2} $$

$$ =- \frac{e^{x_j}}{\sum e^{x_i}} \cdot \frac{e^{x_i}}{\sum e^{x_i}} $$

$$ = -y_j y_i $$

$$ \frac{\partial y_j}{\partial x_i}=y_i – y_i^2 \quad \text{,if $i = j$} $$


$$\frac{\partial y_j}{\partial x_i} = -y_j y_i \quad \text{,if $i \neq j$} $$

Therefore, for \frac{\partial Y}{\partial X}, we got the Jacobian matrix,

$$\frac{\partial \vec{Y}}{\partial \vec{X}} = Diag(Y) – Y^T Y$$

However, in the backward propagation of Softmax, we only need the diagonal, which is in the case of i = j.

$$ \frac{\partial y_i}{\partial x_i} = y_i-y_i^2 =y_i(1-y_i)$$

3.3 Loss (Cross-Entropy) and Softmax

There is a trick with the combination between cross-entropy-error and softmax.

The Cross-Entropy Loss function is,

, where \vec{Y_{log}}^T is apply log to each element of \vec{Y}, and then take transformation.

https://www.matrixcalculus.org

$$\frac{\partial L}{\partial \vec{Y}} = \vec{\mathbf{t}}^T \cdot Diag(Y_{1/Y}) $$

, where Y_{1/Y} is 1/y_i for each element of Y.

$$ \frac{\partial L}{\partial x_k } = \frac{\partial L}{\partial y_k}\cdot \frac{\partial y_k}{\partial x_k} $$

$$\frac{\partial L}{\partial x_k } =- \sum_i \frac{\partial }{\partial {y}_k} \bigg( t_i \ ln(y_i) \bigg) \cdot \frac{\partial y_k}{\partial x_k} $$

$$\frac{\partial L}{\partial x_k } =- \sum_i \frac{t_i }{{ y}_i} \cdot \frac{\partial {y}_i}{\partial x_k} $$

Plug in the derivatives of Softmax, \frac{\partial y}{\partial x}

$$\frac{\partial L}{\partial x_k } = – \frac{t_k }{{ y}_k}
\cdot \frac{\partial {y}_k}{\partial x_k} \sum_{i\neq k} \frac{t_i }{{ y}_i} \cdot \frac{\partial {y}_i}{\partial x_k} $$

$$ = -\frac{t_k }{{ y}_k}
\cdot {y}_k (1-{y}_k) \sum_{i\neq k} \frac{t_i }{{ y}_i} \cdot (- {y}_i {y}_k)
$$

$$ = – t_k (1-{y}_k) \sum_{i\neq k} t_i {y}_k
$$

$$ = – t_k \sum_{i} t_i \ {y}_k
$$

$$ = – t_k {y}k \sum{i} t_i \quad\text{by the tag, $t$, is sum to be 1}
$$

$$ \frac{\partial L}{\partial x_k } ={y}_k – t_k$$

So, we get \frac{\partial L}{\partial x_k} ={y}_k – t_k.

3.4 Affine

$$ \frac{\partial f}{\partial w} $$

$$ \frac{\partial f}{\partial b} $$

  • Backward

$$ \frac{\partial f}{\partial X} =\cdot \ W^T \quad \text{, Right Multiplied}$$

$$ \frac{\partial f}{\partial W} = X^T \cdot \ \quad \text{, Left Multiplied}$$

$$ \frac{\partial f}{\partial b} = \mathbf{1}^T \cdot \quad \text{, Left Multiplied} $$

4. Other Theoretical Details

  1. Batch: mini-batch and epoch for improving training accuracy.
  2. Weights Initialisation: Xavier – Sigmoid and tank, He – ReLU
  3. Weights Updating: SGD, Momentum, Adam, etc.
  4. Overfitting: Batches, Regularisation, Dropout.
  5. Hyperparameters: Bayes

See Chapter 6 of the book.

5. Code

See Attachment for the Code Realisation by purely Numpy.

Code and Books: http://82.157.170.111:1011/s/9D9BCBbCop6ERaD

Duality

$$min\ f_0(x), x \in \mathbb{R}^n$$

$$s.t. f_i(x)\leq 0, \text{for i from 1 to m}$$

$$ \quad h_i(x)=0, \text{for i from 1 to q}$$

  • That is, in Lagrangian form,

$$L(x,\lambda,\gamma)=f_0((x)+\sum \lambda_i f_i (x) +\sum \gamma_i h_i(x)$$

$$ \min_{x} \max_{\lambda,\gamma} L(x,\lambda, \gamma) $$

$$s.t. \lambda \geq 0$$

  • The Duality Problem is,

$$g(\lambda, \gamma) = \min_{x} L(x,\lambda, \gamma)$$

$$\max_{\lambda, \gamma} g(\lambda, \gamma) $$

$$s.t. \nabla_x L(x,\lambda, \gamma)=0$$

$$\quad \lambda \geq 0$$

Why Duality?

We change the original problem in to the duality, which becomes a convexity optimisation problem.

Convexity Optimisation

  • The object function is convex. (or the negative of a concavity)
  • The feasible set is a convex set.

See further study.

KKT for Optimisation

Optimisation

Suppose f, g_1, g_2, …, g_m are differentiable.

$$ \max_{x_1, x_2,…, x_n} f(x_1, x_2, …, x_n) $$

$$s.t. g_1(x_1, x_2, …, x_n)\geq 0 $$

$$\quad g_2(x_1, x_2, …, x_n)\geq 0 $$

$$\quad ……$$

$$\quad g_m(x_1, x_2, …, x_n)\geq 0 $$

There are m constraints g_i, i \in (1,m), and one object function f to maximum. n variables.

KKT Conditions

Suppose the constraint qualification is satisfied. A local maximiser of the above problem \bar{x} = (\bar{x_1}, \bar{x_2},… \bar{x_n}) together with some \bar{lambda_1}, \bar{lambda_2}, …, \bar{lambda_m} \geq 0 satisfies the following:

$$ \frac{\partial f}{\partial x_i}(\bar{x}) + \lambda_1 \frac{\partial g_1}{\partial x_i}(\bar{x})+ \lambda_2 \frac{\partial g_2}{\partial x_i}(\bar{x}) +…+ \lambda_{m} \frac{\partial g_m}{\partial x_i}(\bar{x}) = 0$$

for i=1,2,…,n, and

$$\bar{\lambda_k} g_k(\bar{x})=0$$

for k=1,2,…,m

  • We have state “Suppose the constraint qualification is satisfied” in the beginning. What is the qualification?

See step 4 in the next section.

Solve the Maximisation Problem

  • Step 1. Step up the Lagrangian
  • Step 2. Write down the KKT conditions and constraints.
  • Step 3. Solve the equation set. (P.S. Discuss the complementary slackness, and confirm the possible solutions)
  • Step 4. Check constraint qualification.
  • Step 4.

Constraint Set, C=\{ (x_1, x_2, …, x_n)\in \mathbb{R}^n | g_k(x_1, x_2, …, x_m) \geq 0 \} for k=1,2,…,m.

The constraint qualification is satisfied at \bar{x} in C, if the constraints that hold at \bar{x} with equality are independent. That is, if the m vectors

$$ \nabla g_k(\bar{x})=\bigg( \frac{\partial g_k}{\partial x_1}(\bar{x}), \frac{\partial g_k}{\partial x_2}(\bar{x}),…,\frac{\partial g_k}{\partial x_n}(\bar{x}) \bigg)’ $$

for k=1,2,…,m are independent.