Pre-Probability

Statistics - Fanyu Zhao

The notes is not only a review for the preparetion of quants, but also hopelfully a learning notes for a junior PhD student and also my babe.

Key Points

1. Prelimiaries

Firstly and most importantly, I need to declare what is statistics and why shall we learn statistics.The following only based on my own understanding. My understanding is pretty pretty limited (I got only a master degree, so I am definitely not an expert in Statistics) and subjective, and please provide your suggestions and even your blames to me. Glad to know your ideas.

1.1 What is Statistics?

For my understanding, Statistics is a tool, characterised by mathematics, to explain the world. Such a bull shit am I talking.

Be serious. I may say that statistics is a process to estimate the population by samples.

To do study about the population is always costly, and pretty much unpredictable. For example, to do test in the individual level, we have to collect data from all the people. The population census could only be done in a national level and conducted by the gov. Even that, the census is unable to be performed in a year by year basis, and there are measurement errors always. Thus, a more cost-effective way would be to estimate the population through the data from a small set of people who are randomly selected.

Another example could be the weather forecast, which is similar as doing a time series analysis or panel data analysis. Tthe forecast may most likely to be biased, because things changes unpredictabily and irregularly. So we may say that is even impossible to and the full data to estimate the population (factors related to weather in this case). Thus, a simpler way might be that we collect different factors and historical data about such as termperature, because we may assume the temperature changes are consistent over a short period of time.

However, there are gaps between population and sample. How could we connect those gaps? The answer is Statistics. Statistics provide some mathematical proven methods to make the sample have a better capture about the population, based on assumptions.

Let's begin our study.

2. Probability

2.1 Conditional Probability

\begin{matrix} P (A | B) = \frac{P (A \cap B)}{P (B)} \\ P (A \cap B) = P (A | B) \times P (B) \end{matrix}

2.2 Mutually Exclusive and Independent Events

\begin{matrix} P (A \cap B) = 0 \\ P (A \cup B) = P (A) + P (B) \end{matrix}

If two events are independent, then

\begin{matrix} P (A | B) = P (A) \\ s o, P (A \cap B) = P (A) \times P (B) \end{matrix}

3. Random Variables - r.v.

3.1 Definition

$X, Y, Z$

$x,y,z$

3.2 Probability Mass/Density Funciton - p.d.f. (For Discrete r.v. or Continuous r.v.)

3.2.1 Definition

$X$ $x$ .

P (X = x) = P (x)

3.2.2 Properties of p.d.f.

$f(x)\geq 0$ , since probability is always positive.
$\int_{-\infty}^{+\infty} f(x)\ dx=1$
$P(a<X<b)=\int_a^b f(x) \ dx$

Repalce the intergal with summation for discrete r.v.

$X$ $P(X=x)=0$ . That means for a continuous r.v., any points on the p.d.f have a zero probability.

For example, the probability of selecting a number “3” among 1 to 10 is zero.

3.3 Cumulative Distribution Function - c.d.f

\begin{matrix} F (X) = P (X \leq x) \\ f (x) = \frac{d}{d x} F (x) \end{matrix}

3.4 Expectation

\begin{matrix} E (X) = μ \\ E (X) = \int_{d o m i n X} x \cdot f (x) d x = \sum_{x} x \cdot P (X = x) \end{matrix}

3.5 Variance and Standard Deviation

\begin{matrix} V a r (X) = σ^{2} \\ V a r (X) = E (X - E (X)) = E (X^{2}) - (E (X))^{2} \\ = \frac{\sum (x - μ)^{2}}{n} = \frac{\sum x^{2}}{n} - μ^{2} \end{matrix}

3.6 Moments

$E(X)=\mu$ .

$n^{th}$ $E(X^n)=\int_x x^n\ f(x)\ dx$ .

$E(x-E(X))=\sigma^2$ , Variance.

$E(X-E(X))^3$ $Skewness = \frac{E(X-E(X))^3}{\sigma^3}$ . Standard normal dist has a Skewness of 0. (Right or Left Tails)

$E(X-E(X))^4$ $Kurtosis = \frac{E(X-E(X))^4}{\sigma^4}$ . Standard normal dist has a Kurtosis of 3. (Fat or This, Tall or Short).

3.7 Covariance

\begin{matrix} C o v (X, Y) = E [(X - E (X)) (Y - E ((Y))] \\ = E (X Y) - E (X) E (Y) \end{matrix}

4. Distribution

The meaning of distributions, and the properties (mean & var).

4.1 Bernoulli DIst

4.2 Binomial Dist

4.3 Possion Dist

\begin{matrix} X \sim P o s s i o n (λ) \\ p . d . f P (X = x) = \frac{e^{- λ} λ^{x}}{x!} \\ E (X) = λ, V a r (X) = λ \end{matrix}

4.4 Normal Dist & Standard Normal

\begin{matrix} X \sim N (μ, σ^{2}) \\ p . d . f . f (x) = \frac{1}{\sqrt{2 π σ^{2}}} e x p \frac{(x - μ)^{2}}{2 σ^{2}} \end{matrix}

For a standard normal dist,

\begin{matrix} X \sim N (0, 1) \\ E (X) = 0, V a r (X) = 1 \end{matrix}

4.4.1 Standardisation

Z = \frac{X - μ}{σ}

4.4.2 Propertities of Normal Dist

One / Two /Three standard deviation regions.

5. Central Limit Theorem - CLM

i.i.d. - idependent identical distributed

$X_1,X_2,...,X_n$ $n$ $n$ increases, the distribtuion of

\begin{matrix} X_{1} + X_{2} + . . . + X_{n} \\ and, \\ \frac{X_{1} + X_{2} + . . . + X_{n}}{n} \end{matrix}

would behave like normal distribution.

Key facts:

$X$ is not stated. We do not have to restrict the distribution of r.v.s, as long as they are in same dist.
$X$ $\mu$ $\sigma$ sample mean $\bar{X}$ is normal dist.

\begin{matrix} E (\bar{X}) = E (\frac{\sum X}{n}) = \frac{\sum E (X)}{n} \\ = \frac{n μ}{n} = μ \\ V a r (\bar{X}) = V a r (\frac{\sum X}{n}) = \frac{\sum V a r (X)}{n^{2}} \\ = \frac{n σ^{2}}{n^{2}} = \frac{σ^{2}}{n} \end{matrix}

$\bar{X}$ ,

\bar{X} \sim N (μ, \frac{σ^{2}}{n})

By standardising it,

\frac{\bar{X} - μ}{\frac{σ}{\sqrt{n}}} \sim N (0, 1)

$S_n=X_1+X_2+...+X_n$ .

\begin{matrix} S_{n} \sim N (n μ, n σ^{2}) \\ \frac{S_{n} - n μ}{\sqrt{n} σ} \end{matrix}

The more obervations there are, the more similar the distribution to normal would be. Also, the less standard deviation means the estimate has less variations and is more accurate.

Why is CLT important?

It is important because it provide a way to use repeated obersevations to estimate the whole population, which is impossible to be observed.

6. A Few Notations

Recall, our aim of using statistics is to find the true population. We may assume the true population follows a distribtuion, and that distribution has some parameters. What we are doing right now is to use the sample data (feasiblly collectable) to presume the population parameters.

$\bar{x}$ $S^2$ .
Estimate: the value/figure we truely calculated. By inputing data into estimator, the output is the estimate.

Population (Population Parameters that we want to get but can never get)

$\mu=\frac{\sum x_i}{N}$ .

$\sigma^2=\frac{\sum (x_i-\mu)^2}{N}$ .

Sample Estimator

$\bar{x}=\frac{\sum x_i}{N}$ .

$\hat{\sigma}^2=\frac{\sum (x_i-\bar{x})^2}{N}$

Throw data into sample estimators would get the estimates, and those estimates are then applied to presume the population parameters.

Remember that sample is only part of population ,we collect data from the sample is because they are more accessible and feasible to get. Still, we need to use our sample data to be representatitive to the population, or in another word, to have some foreseers about the whole population. Therefore, we use a different notation on sample statistics.

A important aspect is that we needs our sample to have better representativeness of the population. There are some measurements.

6.1 Unbiasness

$E(\bar{X})=\mu, \text{or} \ E(S^2)=\sigma^2$ (the expectation of our sample estimate is equal to the population), then we would say the estimator is unbiased.

$S^2=\frac{\sum (x_i-\bar{x})^2}{n-1}$ .

E (S^{2}) = σ^{2}

Why the denominator is "n-1"?

$\bar{x}$ $\bar{x}$ $\mu$ $\bar{x}$ is not intrinsicly available (it is costly, to save for the cost, the denominator has a deduction).

$S^2$ $\sigma^2$ . We also have a special name for the sample standard deviation, Standard Error, s.e..

6.2 Consistency

$n\rightarrow \infty$ , the estimator goes close to the population parameter, we may say that estimator is consistent.

$\hat{\sigma}^2=\frac{\sum (x_i-\mu)^2}{N}$ is biased, it is consistent if the number of observation keeps increasing.

Flawness of discussion is available in this section, awaiting to be updated.

7. Estiamtion

7.1 Maximum Likelihood Estimation - MLE

$X$ , fitting into sample observations and trying to find the parameters that can maximise the joint probability (likelihood function).

$\lambda$ that can maximise the likelihood function.

λ_{0} = arg max_{λ} L (λ; x)

$\lambda_0$ is our MLE estimator. (Remember what estimator is? See section 6).

For example

$X\sim N(\mu,\sigma^2)$ $x_1,x_2,...,x_n$ i.i.d. $\mu$ $\sigma^2$ . So, we need to maximise the log-likelihhod function (instead of using likelihood function, we do a logarithm tranformation for easier calculation. Because the log tranformation is monotonoic, the tranformation is legal).

\begin{aligned} f (x_{1}, x_{2}, . . ., x_{n}; μ, σ^{2}) & = f (x_{1}, μ, σ^{2}) f (x_{2}, μ, σ^{2}) . . . f (x_{n}, μ, σ^{2}) \\ Let \\ L (μ, σ^{2}; x_{1}, x_{2}, . . ., x_{n}) & = l o g l (μ, σ^{2}; x_{1}, x_{2}, . . ., x_{n}) \\ = l o g f (x_{1}; μ, σ^{2}) + l o g f (x_{2}; μ, σ^{2}) + . . . + l o g f (x_{n}; μ, σ^{2}) \\ = \sum_{i = 1}^{N} l o g f (x_{i}; μ, σ^{2}) \\ Plug in f (x; μ, σ^{2}) = \frac{1}{\sqrt{2 π σ^{2}}} e x p \frac{(x - μ)^{2}}{2 σ^{2}} \\ L (μ, σ^{2}; x_{1}, . . ., x_{n}) & = l o g [\sum \frac{1}{\sqrt{2 π σ^{2}}} e x p \frac{(x - μ)^{2}}{2 σ^{2}}] \\ = - \frac{n}{2} l o g (2 π) - n \cdot l o g (σ) - \frac{1}{2 σ^{2}} \sum (x_{i} - μ)^{2} \end{aligned}

$F.O.C.$

\begin{matrix} {\hat{μ}}_{M L E} = \frac{1}{n} \sum x_{i} \\ {\hat{σ^{2}}}_{M L E} = \frac{1}{n} \sum (x_{i} - μ)^{2} \end{matrix}

We would find the MLE estimators are same as the OLS estimator in the following section.

7.2 Regression

Assume a linear model through which we can have a minimum sum mean squred.

\begin{matrix} {\hat{β}}_{a l l} = a r g min_{β_{a l l}} \sum (y_{i} - \hat{y_{i}})^{2} \\ \Leftrightarrow \\ \hat{β} = a r g min_{β} (Y - \hat{Y})^{'} (Y - \hat{Y}) \end{matrix}

, where

\begin{matrix} \hat{y} = \hat{β_{0}} + \hat{β_{1}} x_{1} + . . . + \hat{β_{k}} \\ o r, \hat{Y} = X \hat{β} \end{matrix}

$F.O.C.$

\hat{β} = (X^{'} X)^{- 1} X^{'} Y

By Fanyu Zhao