Statistics - Fanyu Zhao

The notes is not only a review for the preparetion of quants, but also hopelfully a learning notes for a junior PhD student and also my babe.

Key Points

1. Prelimiaries

Firstly and most importantly, I need to declare what is statistics and why shall we learn statistics.The following only based on my own understanding. My understanding is pretty pretty limited (I got only a master degree, so I am definitely not an expert in Statistics) and subjective, and please provide your suggestions and even your blames to me. Glad to know your ideas.

1.1 What is Statistics?

For my understanding, Statistics is a tool, characterised by mathematics, to explain the world. Such a bull shit am I talking.

Be serious. I may say that statistics is a process to estimate the population by samples.

To do study about the population is always costly, and pretty much unpredictable. For example, to do test in the individual level, we have to collect data from all the people. The population census could only be done in a national level and conducted by the gov. Even that, the census is unable to be performed in a year by year basis, and there are measurement errors always. Thus, a more cost-effective way would be to estimate the population through the data from a small set of people who are randomly selected.

Another example could be the weather forecast, which is similar as doing a time series analysis or panel data analysis. Tthe forecast may most likely to be biased, because things changes unpredictabily and irregularly. So we may say that is even impossible to and the full data to estimate the population (factors related to weather in this case). Thus, a simpler way might be that we collect different factors and historical data about such as termperature, because we may assume the temperature changes are consistent over a short period of time.

However, there are gaps between population and sample. How could we connect those gaps? The answer is Statistics. Statistics provide some mathematical proven methods to make the sample have a better capture about the population, based on assumptions.

Let's begin our study.

2. Probability

2.1 Conditional Probability
P(A|B)=P(AB)P(B)P(AB)=P(A|B)×P(B)
2.2 Mutually Exclusive and Independent Events
P(AB)=0P(AB)=P(A)+P(B)

If two events are independent, then

P(A|B)=P(A)so,P(AB)=P(A)×P(B)

3. Random Variables - r.v.

3.1 Definition

Random Variables: X,Y,Z

Observations: x,y,z

3.2 Probability Mass/Density Funciton - p.d.f. (For Discrete r.v. or Continuous r.v.)
3.2.1 Definition

p.d.f captures the probability that a r.v. X has a given value of x.

P(X=x)=P(x)
3.2.2 Properties of p.d.f.
  1. f(x)0, since probability is always positive.
  2. +f(x) dx=1
  3. P(a<X<b)=abf(x) dx

Repalce the intergal with summation for discrete r.v.

P.S. For continuous r.v. X, P(X=x)=0. That means for a continuous r.v., any points on the p.d.f have a zero probability.

For example, the probability of selecting a number “3” among 1 to 10 is zero.

3.3 Cumulative Distribution Function - c.d.f
F(X)=P(Xx)f(x)=ddxF(x)
3.4 Expectation
E(X)=μE(X)=dominXxf(x) dx=xxP(X=x)
3.5 Variance and Standard Deviation
Var(X)=σ2Var(X)=E(XE(X))=E(X2)(E(X))2=(xμ)2n=x2nμ2
3.6 Moments

The first moment, E(X)=μ.

The nth moment, E(Xn)=xxn f(x) dx.

The second central moment about mean. E(xE(X))=σ2, Variance.

The third central moment, E(XE(X))3. Skewness=E(XE(X))3σ3. Standard normal dist has a Skewness of 0. (Right or Left Tails)

The Fourth central moment, E(XE(X))4. Kurtosis=E(XE(X))4σ4. Standard normal dist has a Kurtosis of 3. (Fat or This, Tall or Short).

3.7 Covariance
Cov(X,Y)=E[(XE(X))(YE((Y))]=E(XY)E(X)E(Y)

4. Distribution

The meaning of distributions, and the properties (mean & var).

4.1 Bernoulli DIst
4.2 Binomial Dist
4.3 Possion Dist
XPossion(λ)p.d.fP(X=x)=eλλxx!E(X)=λ,Var(X)=λ
4.4 Normal Dist & Standard Normal
XN(μ,σ2)p.d.f.f(x)=12πσ2exp(xμ)22σ2

For a standard normal dist,

XN(0,1)E(X)=0,Var(X)=1
4.4.1 Standardisation
Z=Xμσ

 

4.4.2 Propertities of Normal Dist

One / Two /Three standard deviation regions.

5. Central Limit Theorem - CLM

i.i.d. - idependent identical distributed

Suppose X1,X2,...,Xn are n idependent r.v., each have same distribution, and as the number n increases, the distribtuion of

X1+X2+...+Xnand,X1+X2+...+Xnn

would behave like normal distribution.

Key facts:

  1. The distribtuion of X is not stated. We do not have to restrict the distribution of r.v.s, as long as they are in same dist.
  2. If X is a r.v. with mean μ and standard deviation σ from a random dist, the CLT tells that the distribution of sample mean, X¯ is normal dist.
E(X¯)=E(Xn)=E(X)n=nμn=μVar(X¯)=Var(Xn)=Var(X)n2=nσ2n2=σ2n

Therefore, we would get the distribution of X¯,

X¯N(μ,σ2n)

By standardising it,

X¯μσnN(0,1)

Also, for Sn=X1+X2+...+Xn.

SnN(nμ,nσ2)Snnμnσ

The more obervations there are, the more similar the distribution to normal would be. Also, the less standard deviation means the estimate has less variations and is more accurate.

Why is CLT important?

It is important because it provide a way to use repeated obersevations to estimate the whole population, which is impossible to be observed.

6. A Few Notations

Recall, our aim of using statistics is to find the true population. We may assume the true population follows a distribtuion, and that distribution has some parameters. What we are doing right now is to use the sample data (feasiblly collectable) to presume the population parameters.

Population (Population Parameters that we want to get but can never get)

Population Mean: μ=xiN.

Population Variance: σ2=(xiμ)2N.

Sample Estimator

Sample Mean: x¯=xiN.

Sample Variance: σ^2=(xix¯)2N

Throw data into sample estimators would get the estimates, and those estimates are then applied to presume the population parameters.

Remember that sample is only part of population ,we collect data from the sample is because they are more accessible and feasible to get. Still, we need to use our sample data to be representatitive to the population, or in another word, to have some foreseers about the whole population. Therefore, we use a different notation on sample statistics.

A important aspect is that we needs our sample to have better representativeness of the population. There are some measurements.

6.1 Unbiasness

If E(X¯)=μ,or E(S2)=σ2 (the expectation of our sample estimate is equal to the population), then we would say the estimator is unbiased.

The unbiased estimator of sample variance is S2=(xix¯)2n1.

E(S2)=σ2

Why the denominator is "n-1"?

There would be a long discussion to talk about that. We can simply understand "-1" as the adjustment of the x¯ in the numerator, because x¯ is calculated to represent the population mean μ and x¯ is not intrinsicly available (it is costly, to save for the cost, the denominator has a deduction).

In sum, S2 is an unbiased estimator of population variance, σ2. We also have a special name for the sample standard deviation, Standard Error, s.e..

6.2 Consistency

If there is a estimator such that as n, the estimator goes close to the population parameter, we may say that estimator is consistent.

For example, although σ^2=(xiμ)2N is biased, it is consistent if the number of observation keeps increasing.

Flawness of discussion is available in this section, awaiting to be updated.

7. Estiamtion

7.1 Maximum Likelihood Estimation - MLE

By assuming a probability distribtuion of the r.v. X, fitting into sample observations and trying to find the parameters that can maximise the joint probability (likelihood function).

To illustrate the problem, we need to find the parameters λ that can maximise the likelihood function.

λ0=argmaxλ L(λ;x)

The value of the parameters λ0 is our MLE estimator. (Remember what estimator is? See section 6).

For example

Assume r.v. XN(μ,σ2). Let x1,x2,...,xn be a random sample of i.i.d. observations. We use MLE to find the value of μ and σ2. So, we need to maximise the log-likelihhod function (instead of using likelihood function, we do a logarithm tranformation for easier calculation. Because the log tranformation is monotonoic, the tranformation is legal).

f(x1,x2,...,xn;μ,σ2)=f(x1,μ,σ2)f(x2,μ,σ2)...f(xn,μ,σ2)LetL(μ,σ2;x1,x2,...,xn)=log l(μ,σ2;x1,x2,...,xn)=log f(x1;μ,σ2)+log f(x2;μ,σ2)+...+log f(xn;μ,σ2)=i=1Nlog f(xi;μ,σ2)Plug in f(x;μ,σ2)=12πσ2exp(xμ)22σ2L(μ,σ2;x1,...,xn)=log [12πσ2exp(xμ)22σ2]=n2log (2π)nlog (σ)12σ2(xiμ)2

F.O.C.

μ^MLE=1nxiσ2^MLE=1n(xiμ)2

We would find the MLE estimators are same as the OLS estimator in the following section.

7.2 Regression

Assume a linear model through which we can have a minimum sum mean squred.

β^all=argminβall(yiyi^)2β^=argminβ(YY^)(YY^)

, where

y^=β0^+β1^x1+...+βk^or,Y^=Xβ^

F.O.C.

β^=(XX)1XY

By Fanyu Zhao