The notes are not only a review for the preparation of quants but also hopefully a learning note for my babe.
Key Points
1. Preliminaries
Firstly and most importantly, I need to declare what is statistics and why shall we learn statistics. The following is only based on my own understanding. My understanding is pretty limited (I got only a master’s degree, so I am definitely not an expert in Statistics) and subjective, please provide your suggestions and even your blames to me. Glad to know your ideas.
1.1 What is Statistics?
From my understanding, Statistics is a tool, characterized by mathematics, to explain the world. Such a bull shit am I talking.
Be serious. I may say that statistics is a process to estimate the population by samples.
To do a study about the population is always costly, and pretty much unpredictable. For example, to do tests in the individual level, we have to collect data from all the people. The population census could only be done in a national level and conducted by the gov. Even so, the census is unable to be performed in a year-by-year basis, and there are measurement errors always. Thus, a more cost-effective way would be to estimate the population through the data from a small set of people who are randomly selected.
Another example could be the weather forecast, which is similar to doing a time series analysis or panel data analysis. The forecast may most likely be biased because things change unpredictably and irregularly. So we may say that is even impossible to and the full data to estimate the population (factors related to weather in this case). Thus, a simpler way might be that we collect different factors and historical data such as temperature, because we may assume the temperature changes are consistent over a short period of time.
However, there are gaps between population and sample. How could we connect those gaps? The answer is Statistics. Statistics provide some mathematical proven methods to make the sample have a better capture of the population, based on assumptions.
Let’s begin our study.
2. Probability
2.1 Conditional Probability
$$
P(A|B)=\frac{P(A\cap B)}{P(B)}\\ \\ P(A\cap B)=P(A|B)\times P(B)
$$
2.2 Mutually Exclusive and Independent Events
$$
P(A\cap B)=0 \\ \\ P(A\cup B)=P(A)+P(B)
$$
If two events are independent, then
$$
P(A|B)=P(A) \\ \\ so, \quad P(A\cap B)=P(A)\times P(B)
$$
3. Random Variables – r.v.
3.1 Definition
Random Variables: X, Y, Z
Observations: x,y,z
3.2 Probability Mass/Density Funciton – p.d.f. (For Discrete r.v. or Continuous r.v.)
3.2.1 Definition
p.d.f captures the probability that a r.v. X has a given value of x.
$$
P(X=x)=P(x)
$$
3.2.2 Properties of p.d.f.
- \(f(x)\geq 0\), since probability is always positive.
- \(\int_{-\infty}^{+\infty} f(x)\ dx=1\)
- \(P(a<X<b)=\int_a^b f(x) \ dx\)
Replace the integral with summation for discrete r.v.
P.S. For continuous r.v. X, P(X=x)=0. That means for a continuous r.v., any points on the p.d.f have a zero probability.
For example, the probability of selecting a number “3” among 1 to 10 is zero.
3.3 Cumulative Distribution Function – c.d.f
$$
F(X)=P(X\leq x) \\ \\ f(x)=\frac{d}{dx} F(x)
$$
3.4 Expectation
$$
E(X)=\mu \\ \\ E(X)=\int_{dominX}x\cdot f(x)\ dx=\sum_x x\cdot P(X=x) \\
$$
3.5 Variance and Standard Deviation
$$
Var(X)=\sigma^2\\ \\ Var(X)=E(X-E(X))=E(X^2)-(E(X))^2 \\ =\frac{\sum (x-\mu)^2}{n}=\frac{\sum x^2}{n}-\mu^2
$$
3.6 Moments
The first moment, \(E(X)=\mu\).
The n^{th} moment, \(E(X^n)=\int_x x^n\ f(x)\ dx\).
The second central moment is about mean. \(E(X-E(X))=\sigma^2\), Variance.
The third central moment, \(E(X-E(X))^3. Skewness = \frac{E(X-E(X))^3}{\sigma^3}\). Standard normal dist has a Skewness of 0. (Right or Left Tails)
The Fourth central moment, \(E(X-E(X))^4. Kurtosis = \frac{E(X-E(X))^4}{\sigma^4}\). Standard normal dist has a Kurtosis of 3. (Fat or This, Tall or Short).
3.7 Covariance
$$
Cov(X,Y)=E[(X-E(X))(Y-E((Y))]\\ =E(XY)-E(X)E(Y)
$$
4. Distribution
The meaning of distributions, and the properties (mean & var).
4.1 Bernoulli DIst
4.2 Binomial Dist
4.3 Possion Dist
$$
X \sim Possion(\lambda)\\ \\p.d.f \quad P(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}\\ E(X)=\lambda,\quad Var(X)=\lambda
$$
4.4 Normal Dist & Standard Normal
$$
X\sim N(\mu,\sigma^2)\\ \\ p.d.f. \quad f(x)=\frac{1}{\sqrt{2\pi \sigma^2}}exp\frac{(x-\mu)^2}{2\sigma^2}
$$
For a standard normal dist,
$$
X\sim N(0,1)\\ \\ E(X)=0,\quad Var(X)=1
$$
4.4.1 Standardisation
$$
Z=\frac{X-\mu}{\sigma}
$$
4.4.2 Properties of Normal Dist
One / Two /Three standard deviation regions.
5. Central Limit Theorem – CLM
i.i.d. – independent identical distributed
Suppose X_1,X_2,…,X_n are n independent r.v., each has the same distribution, and as the number n increases, the distribtuion of
$$
X_1+X_2+…+X_n\\\\ \text{and,}\\\\ \frac{X_1+X_2+…+X_n}{n}
$$
would behave like a normal distribution.
Key facts:
- The distribution of X is not stated. We do not have to restrict the distribution of r.v.s, as long as they are in the same dist.
- If X is a r.v. with mean \(\mu\) and standard deviation \(\sigma\) from a random dist, the CLT tells that the distribution of the sample mean, \(\bar{X}\) is normal dist.
$$
E(\bar{X})=E(\frac{\sum X}{n})=\frac{\sum E(X)}{n}\\=\frac{n\mu}{n}=\mu\\ \\ $$
$$Var(\bar{X})=Var(\frac{\sum X}{n})=\frac{\sum Var(X)}{n^2}\\=\frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n}\\
$$
Therefore, we would get the distribution of \bar{X},
$$
\bar{X}\sim N(\mu,\frac{\sigma^2}{n})
$$
By standardising it,
$$
\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\sim N(0,1)
$$
Also, for S_n=X_1+X_2+…+X_n.
$$
S_n \sim N(n\mu,n\sigma^2)\\\\ \frac{S_n-n\mu}{\sqrt{n}\sigma}
$$
The more observations there are, the more similar the distribution to normal would be. Also, a less standard deviation means the estimate has fewer variations and is more accurate.
Why is CLT important?
It is important because it provides a way to use repeated observations to estimate the whole population, which is impossible to be observed.
6. A Few Notations
Recall, our aim of using statistics is to find the true population. We may assume the true population follows a distribution, and that distribution has some parameters. What we are doing right now is to use the sample data (feasibly collectible) to presume the population parameters.
- Estimator: a function, using sample or available data, to estimate the population. i.e. \(\bar{x}\) and \(S^2\).
- Estimate: the value/figure we truly calculated. By inputting data into the estimator, the output is the estimate.
Population (Population Parameters that we want to get but can never get)
Population Mean: \(\mu=\frac{\sum x_i}{N}\).
Population Variance: \(\sigma^2=\frac{\sum (x_i-\mu)^2}{N}\).
Sample Estimator
Sample Mean: \(\bar{x}=\frac{\sum x_i}{N}\).
Sample Variance: \(\hat{\sigma}^2=\frac{\sum (x_i-\bar{x})^2}{N}\).
Throw data into sample estimators would get the estimates, and those estimates are then applied to presume the population parameters.
Remember that sample is only part of the population, we collect data from the sample because they are more accessible and feasible to get. Still, we need to use our sample data to be representative of the population, or in another word, to have some foreseers about the whole population. Therefore, we use a different notation for sample statistics.
An important aspect is that we need our sample to have better representativeness of the population. There are some measurements.
6.1 Unbiasedness
If \(E(\bar{X})=\mu, \text{or} \ E(S^2)=\sigma^2 \)(the expectation of our sample estimate is equal to the population), then we would say the estimator is unbiased.
The unbiased estimator of sample variance is \(S^2=\frac{\sum (x_i-\bar{x})^2}{n-1}\).
$$
E(S^2)=\sigma^2
$$
Why the denominator is “n-1”?
There would be a long discussion to talk about that. We can simply understand “-1” as the adjustment of the \(\bar{x}\) in the numerator because \(\bar{x}\) is calculated to represent the population mean \(\mu\) and \(\bar{x}\) is not intrinsically available (it is costly, to save for the cost, the denominator has a deduction).
In sum, \(S^2\) is an unbiased estimator of population variance, \(\sigma^2\). We also have a special name for the sample standard deviation, Standard Error, s.e..
6.2 Consistency
If there is an estimator such that as \(n\rightarrow \infty\) , the estimator goes close to the population parameter, we may say that estimator is consistent.
For example, although \(\hat{\sigma}^2=\frac{\sum (x_i-\mu)^2}{N}\) is biased, it is consistent if the number of observation keeps increasing.
Flaws of discussion is available in this section, awaiting to be updated.
7. Estimation
7.1 Maximum Likelihood Estimation – MLE
By assuming a probability distribution of the r.v. X, fitting into sample observations and trying to find the parameters that can maximise the joint probability (likelihood function).
To illustrate the problem, we need to find the parameters \lambda that can maximise the likelihood function.
$$
\lambda_0=\text{arg}\max_{\lambda}\ L(\lambda;x)
$$
The value of the parameters \(\lambda_0\) is our MLE estimator. (Remember what estimator is? See section 6).
For example
Assume r.v. \(X\sim N(\mu,\sigma^2)\). Let \(x_1,x_2,…,x_n\) be a random sample of i.i.d. observations. We use MLE to find the value of \(\mu\) and \(\sigma^2\). So, we need to maximise the log-likelihood function (instead of using the likelihood function, we do a logarithm transformation for easier calculation. Because the log transformation is monotonic, the transformation is legal).
$$
\begin{align*} f(x_1,x_2,…,x_n;\mu,\sigma^2)&=f(x_1,\mu,\sigma^2)f(x_2,\mu,\sigma^2)…f(x_n,\mu,\sigma^2)\\ \text{Let}\\L(\mu,\sigma^2;x_1,x_2,…,x_n)&=log \ l(\mu,\sigma^2;x_1,x_2,…,x_n)\\ &=log\ f(x_1;\mu,\sigma^2)+log\ f(x_2;\mu,\sigma^2)+…+log\ f(x_n;\mu,\sigma^2)\\ &=\sum_{i=1}^N log\ f(x_i;\mu,\sigma^2) \\ \text{Plug in }f(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi \sigma^2}}exp\frac{(x-\mu)^2}{2\sigma^2} \\ L(\mu,\sigma^2;x_1,…,x_n)&=log\ [\sum \frac{1}{\sqrt{2\pi \sigma^2}}exp\frac{(x-\mu)^2}{2\sigma^2} ] \\ &=-\frac{n}{2}log\ (2\pi)-n\cdot log\ (\sigma)-\frac{1}{2\sigma^2}\sum (x_i-\mu)^2 \end{align*}
$$
F.O.C.
$$
\hat{\mu}_{MLE}=\frac{1}{n}\sum x_i \\ \hat{\sigma^2}_{MLE}=\frac{1}{n}\sum (x_i-\mu)^2
$$
We would find the MLE estimators are the same as the OLS estimator in the following section.
7.2 Regression
Assume a linear model through which we can have a minimum sum mean squared.
$$
\hat{\beta}_{all}=arg\min_{\beta_{all}}\sum(y_i-\hat{y_i})^2\\ \Leftrightarrow\\ \hat{\beta}=arg\min_{\beta}(Y-\hat{Y})'(Y-\hat{Y})\\ \\
$$
, where
$$
\hat{y}=\hat{\beta_0}+\hat{\beta_1}x_1+…+\hat{\beta_k}\\ or,\quad \hat{Y}=X\hat{\beta}
$$
F.O.C.
$$\hat{\beta}=(X’X)^{-1}X’Y$$