12 Mean Inference

This section covers population mean inference (estimating the population mean) under various assumptions and conditions. Each section covers mean inference under different conditions and assumptions.

The sections are:

Mean Inference (n.1.f.k): normal, iid, univariate, frequentist, known variance
Mean Inference (n.1.f): normal, iid, univariate, frequentist, unknown variance
Mean Inference (g.1.f): general, iid, univariate, frequentist
Mean Inference (n.2+.f): normal, iid, multivariate, frequentist
Mean Inference (g.2+.f): general, iid, multivariate, frequentist
Mean Inference (n.1.b.k): normal, iid, univariate, Bayesian, known variance
Mean Inference (n.1.b): normal, iid, univariate, Bayesian, unknown variance
Mean Inference (g.1.b): general, iid, univariate, Bayesian
Mean Inference (n.2+.b): normal, iid, multivariate, Bayesian
Mean Inference (g.2+.b): general, iid, multivariate, Bayesian

12.1 Normal-Univariate-Frequentist-Known-Variance

This section covers population mean inference when the population distribution is assumed to be normal and population variance is known ex ante. More specifically, mean inference is performed under the following assumptions and conditions:

Assumptions:

Population follows a normal distribution
Observations are iid
Known population variance ex ante

Other Conditions, Criteria, or Attributes:

Univariate mean inference
Frequentist perspective

12.1.1 Overview and Summary

Under the assumptions above, inference can be done using the Z-statistic and Z-distribution (standard normal distribution).

Key Takeaways:

We calculate the Z-statistic and use the Z-values and Z-distribution (standard normal distribution) to create the confidence interval and perform hypothesis testing since the population variance is known. If the population variance is unknown, we would use the T-statistic and T-distribution.
You can think of the test statistic as a way to relate the sample data, our hypothesized population parameter, and a likelihood. Z can be plugged into a standard normal PDF to determine a likelihood.

12.1.2 Point Estimation

Most commonly we use the following formula to generate an unbiased estimate of the population mean. The formula is:

\[\bar{x} = \frac{\sum^n_{i=1}{x_{i}}}{n}\]

Where:

\(\bar{x}\) = unbiased estimate of population mean
\(n\) = sample size
\(x_{i}\) = your sample observations

The formula above not only produces an unbiased estimate under the assumptions in this section, but always produces an unbiased estimate of the population mean no matter the distribution (as long as the population mean exists).

Key Takeaways:

The above formula will produce an unbiased estimate no matter what the population distribution is.
While you could use maximum likelihood estimation, method of moments, or other estimation techniques to produce a point estimate for the population mean, in practice we just use the formula above. Under the normal assumption, both maximum likelihood estimation and method of moments will result in the same estimate of the mean as the formula above.

12.1.3 Confidence Interval

The formula for the confidence interval given our assumptions is:

\[\bar{x} \pm z_{\frac{\alpha}{2}}(\frac{\sigma}{\sqrt{n}})\]

Where:

\(\bar{x}\) is the sample mean
\(\alpha\) is the significance level
\(\sigma\) is the population standard deviation, which we know ex ante
\(n\) is the sample size

Like constructing all confidence intervals, we choose a level of \(\alpha\) (Type I error) first, then build the interval. The confidence interval gives us a range around our sample estimate with a (1 - \(\alpha\)) probability that it will capture the true population parameter.

Key Takeaways:

We build the confidence interval using the Z-distribution as opposed to T-distribution when the population is normal and variance is known
For a given level of alpha, the interval built using the Z-distribution will be narrower than one built using the T-distribution
The expression for confidence interval uses a rearrangement of the formula for the \(Z\) statistic. The \(Z\) statistic relates \(\bar{x}\) with our hypothesized value for the population mean, \(\mu\).
When it comes to interpretation, from a frequentist standpoint, the confidence interval is not the probability that the population will fall into the range. Instead, it is the probability that our confidence interval captures the population parameter. That’s because what’s random is our sample, not the population parameter.

12.1.4 Hypothesis Testing

The formula for the test statistic, in this case the Z-statistic, used for hypothesis testing is:

\[Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}\]

Where:

\(\bar{x}\) is the sample mean
\(\mu\) is the population mean
\(\alpha\) is the significance level
\(\sigma\) is the population standard deviation, which we know ex ante
\(n\) is the sample size

Z follows a normal distribution with mean of 0 and variance of 1. For hypothesis testing, we either use Z to determine a p-value. Or we see if Z exceeds a certain threshold value.

Key Takeaways:

In the below, “large” Z-statistics and Z-values is “large” in the absolute sense.
Large Z-statistic values mean that the distance between our sample mean and hypothesized population mean value is large
The p-value is the probability of seeing a Z-value at least as large as the observed Z-statistic. Large Z-statistics lead to small p-values. Under the p-value method, small p-values suggest a rejection of the null hypothesis as it is highly unlikely that given our sample mean, the population mean is at least as far away as our hypothesized value. We reject the null hypothesis if the p-value is less than \(\alpha\). Usually, we reject the null hypothesis if the p-value is less than 0.05.
Under the critical value method, large Z-statistics suggest a rejection of the null hypothesis. A large Z-statistic suggests that there is a low likelihood that the deviation between the hypothesized value and our sample statistic is that large. Thus, it implies that it is unlikely that the hypothesized value is correct. We reject the null hypothesis if the test statistic is larger in absolute value than the critical value. As a rule of thumb, Z-statistics larger than 2 suggests a rejection of the null hypothesis.

12.1.5 Suggested Steps

Here are some suggested steps for performing mean inference in practice.

Point Estimation:

Make sure you know the population variance ex ante. This is unrealistic in practice so we generally do not use this inference method.
Obtain sample data and ensure it is iid. Random samples are usually iid, by definition.
Check that the sample data fits the normal assumption using methods such as QQ plot. If it is not normal, you cannot use this method.
Calculate the sample mean

Confidence Interval:

Continuing from above…
Choose an \(\alpha\)
Construct the confidence interval
The confidence interval is a range that has a \((1 - \alpha)\) percent chance of capturing the true population mean

Hypothesis Testing:

Continuing from above…
Set up your \(H_{0}\) and \(H_{a}\) statements
Calculate the Z-statistic
Using either statistical software or the Z-Table, use either the p-value or critical value method to determine whether or not to reject your null hypothesis
Interpret your results

12.1.6 Doing It In R

Coming soon

12.1.7 Sources and Useful Links

Sources:

Degroot: Probability And Statistics (4th) - DeGroot, Schervish

Useful Links:

https://en.wikibooks.org/wiki/Statistics/Testing_Data/z-tests

12.2 Normal-Univariate-Frequentist-Unknown-Variance

This section covers population mean inference when the population distribution is assumed to be normal and population variance is unknown. More specifically, mean inference is performed under the following assumptions and conditions:

Assumptions:

Population follows a normal distribution
Observations are iid
Unknown population variance

Other Conditions, Criteria, or Attributes:

Univariate mean inference
Frequentist perspective

12.2.1 Overview and Summary

Under the assumptions above, inference can be done using the T-statistic and T-distribution.

Key Takeaways:

We calculate the T-statistic and use the T-values and T-distribution to create the confidence interval and perform hypothesis testing since the population variance is unknown. If the population variance is known, we can use the Z-statistic and Z-distribution.
If the sample size is greater than 30, our results from using the T-statistic and T-distribution will be very similar to results from using the Z-statistic and Z-distribution
You can think of the test statistic as a way to relate the sample data, our hypothesized population parameter, and a likelihood. T can be plugged into the PDF of a T-distribution to return the likelihood of our sample statistic and our hypothesized value being the values provided.

12.2.2 Point Estimation

Most commonly we use the following formula to generate an unbiased estimate of the population mean. The formula is:

\[\bar{x} = \frac{\sum^n_{i=1}{x_{i}}}{n}\]

Where:

\(\bar{x}\) = unbiased estimate of population mean
\(n\) = sample size
\(x_{i}\) = your sample observations

Key Takeaways:

The above formula will produce an unbiased estimate no matter what the population distribution is.
While you could use maximum likelihood estimation, method of moments, or other estimation techniques to produce a point estimate for the population mean, in practice we just use the formula above. Under the normal assumption, both maximum likelihood estimation and method of moments will result in the same estimate of the mean as the formula above.

12.2.3 Confidence Interval

The formula for the confidence interval given our assumptions is:

\[\bar{x} \pm t_{\frac{\alpha}{2}, n-1}(\frac{s}{\sqrt{n}})\]

Where:

\(\bar{x}\) is the sample mean
\(\alpha\) is the significance level
\(s\) is the sample standard deviation using the corrected sample standard deviation formula
\(n\) is the sample size
\(t\) comes from a t-distribution with \(n - 1\) degrees of freedom

Key Takeaways:

We build the confidence interval using the T-distribution as opposed to Z-distribution when the population is normal and variance is unknown
For a given level of alpha, the interval built using the T-distribution will be wider than one built using the Z-distribution
The expression for confidence interval uses a rearrangement of the formula for the \(T\) statistic. The \(T\) statistic relates \(\bar{x}\) with our hypothesized value for the population mean, \(\mu\).
When it comes to interpretation, from a frequentist standpoint, the confidence interval is not the probability that the population will fall into the range. Instead, it is the probability that our confidence interval captures the population parameter. That’s because what’s random is our sample, not the population parameter.

12.2.4 Hypothesis Testing

The formula for the test statistic, in this case the T-statistic, used for hypothesis testing is:

\[T = \frac{\bar{x} - \mu}{\frac{S}{\sqrt{n}}}\]

Where:

\(\bar{x}\) = unbiased estimate of population mean
\(n\) = sample size
\(S\) = sample standard deviation using the corrected sample standard deviation formula

T follows a Student’s T-distribution with \(n-1\) degrees of freedom. For hypothesis testing, we either use T to determine a p-value. Or we see if T exceeds a certain threshold value.

Key Takeaways:

In the below, “large” T-statistics and T-values is “large” in the absolute sense.
Large T-statistic values mean that the distance between our sample mean and hypothesized population mean value is large
The p-value is the probability of seeing a T-value at least as large as the observed T-statistic. Large T-statistics lead to small p-values. Under the p-value method, small p-values suggest a rejection of the null hypothesis as it is highly unlikely that given our sample mean, the population mean is at least as far away as our hypothesized value. We reject the null hypothesis if the p-value is less than \(\alpha\). Usually, we reject the null hypothesis if the p-value is less than 0.05.
Under the critical value method, large T-statistics suggest a rejection of the null hypothesis. A large T-statistic suggests that there is a low likelihood that the deviation between the hypothesized value and our sample statistic is that large. Thus, it implies that it is unlikely that the hypothesized value is correct. We reject the null hypothesis if the test statistic is larger in absolute value than the critical value. As a rule of thumb, with a large enough sample size, T-statistics larger than 2 suggests a rejection of the null hypothesis.

12.2.5 Suggested Steps

Here are some suggested steps for performing mean inference in practice.

Point Estimation:

Obtain sample data and ensure it is iid. Random samples are usually iid, by definition.
Check that the sample data fits the normal assumption using methods such as QQ plot.
Calculate the sample mean

Confidence Interval:

Continuing from above…
Choose an \(\alpha\)
Construct the confidence interval
The confidence interval is a range that has a \((1 - \alpha)\) percent chance of capturing the true population mean

Hypothesis Testing:

Continuing from above…
Set up your \(H_{0}\) and \(H_{a}\) statements
Calculate the T-statistic
Using either statistical software or the T-Table, use either the p-value or critical value method to determine whether or not to reject your null hypothesis
Interpret your results

12.2.6 Doing It In R

We can do T-tests in base R using t.test(). In this example, I simulate our observations from a normal distribution. x in the below is a vector of sample data.

12.2.7 Sources and Useful Links

Sources:

Degroot: Probability And Statistics (4th) - DeGroot, Schervish

Useful Links:

12.3 General-Univariate-Frequentist

This section covers population mean inference when we do not make any prior assumptions about the distribution of the population. More specifically, mean inference is performed under the following assumptions and conditions:

Assumptions:

Population can follow any shaped distribution with finite mean (not necessarily normal)
Observations are iid
Sample size must be large enough or one should use a non-parametric method
Population variance can be known or unknown (and must be finite)

Other Conditions, Criteria, or Attributes:

Univariate mean inference
Frequentist perspective

12.3.1 Overview and Summary

Under the assumptions above, inference can be done using the T-statistic and T-distribution.

Key Takeaways:

With the shape of the population distribution unknown, we will always use the T-statistic and T-distribution to create the confidence interval. This is true regardless if we know the population variance or not.
When your sample size exceeds 30 or so (need citation), the results from doing a T-test will be very similar to the results from a Z-test. This is because the T-distribution converges to a Z-distribution (standard normal distribution) as your sample size increases.
The rule of thumb is 30, but in practice it will depend on the skewness of the population distribution. The more skewed your population is, the more samples we need. (need citation) make sure not confuse with CLT. With sample sizes much larger than 30, we could use the Z-statistic for hypothesis testing and confidence interval building but it’s usually safer to always use the T-statistic.

12.3.2 Point Estimation

Most commonly we use the following formula to generate an unbiased estimate of the population mean. The formula is:

\[\bar{x} = \frac{\sum^n_{i=1}{x_{i}}}{n}\]

Where:

\(\bar{x}\) = unbiased estimate of population mean
\(n\) = sample size
\(x_{i}\) = your sample observations

Key Takeaways:

The above formula will produce an unbiased estimate no matter what the population distribution is.
While you could use maximum likelihood estimation, method of moments, or other estimation techniques to produce a point estimate for the population mean, in practice we just use the formula above. Under the normal assumption, both maximum likelihood estimation and method of moments will result in the same estimate of the mean as the formula above.

12.3.3 Confidence Interval

The formula for the confidence interval given our assumptions is:

\[\bar{x} \pm t_{\frac{\alpha}{2}, n-1}(\frac{s}{\sqrt{n}})\]

Where:

\(\bar{x}\) is the sample mean
\(\alpha\) is the significance level
\(s\) is the sample standard deviation using the corrected sample standard deviation formula
\(n\) is the sample size
\(t\) comes from a t-distribution with \(n - 1\) degrees of freedom

Key Takeaways:

We always use the T-distribution to build a confidence interval in this situation.
The expression for confidence interval uses a rearrangement of the formula for the \(T\) statistic. The \(T\) statistic relates \(\bar{x}\) with our hypothesized value for the population mean, \(\mu\).
When it comes to interpretation, from a frequentist standpoint, the confidence interval is not the probability that the population will fall into the range. Instead, it is the probability that our confidence interval captures the population parameter. That’s because what’s random is our sample, not the population parameter.

12.3.4 Hypothesis Testing

The formula for the test statistic, in this case the T-statistic, used for hypothesis testing is:

\[T = \frac{\bar{x} - \mu}{\frac{S}{\sqrt{n}}}\]

Where:

\(\bar{x}\) = unbiased estimate of population mean
\(n\) = sample size
\(S\) = sample standard deviation using the corrected sample standard deviation formula

T follows a Student’s T-distribution with \(n-1\) degrees of freedom. For hypothesis testing, we either use T to determine a p-value. Or we see if T exceeds a certain threshold value.

Key Takeaways:

In the below, “large” T-statistics and T-values is “large” in the absolute sense.
Large T-statistic values mean that the distance between our sample mean and hypothesized population mean value is large
The p-value is the probability of seeing a T-value at least as large as the observed T-statistic. Large T-statistics lead to small p-values. Under the p-value method, small p-values suggest a rejection of the null hypothesis as it is highly unlikely that given our sample mean, the population mean is at least as far away as our hypothesized value. We reject the null hypothesis if the p-value is less than \(\alpha\). Usually, we reject the null hypothesis if the p-value is less than 0.05.
Under the critical value method, large T-statistics suggest a rejection of the null hypothesis. A large T-statistic suggests that there is a low likelihood that the deviation between the hypothesized value and our sample statistic is that large. Thus, it implies that it is unlikely that the hypothesized value is correct. We reject the null hypothesis if the test statistic is larger in absolute value than the critical value. As a rule of thumb, with a large enough sample size, T-statistics larger than 2 suggests a rejection of the null hypothesis.

12.3.5 Suggested Steps

Here are some suggested steps for performing mean inference in practice.

Point Estimation:

Obtain sample data and ensure it is iid. Random samples are usually iid, by definition.
Check the sample data to see if it is skewed. More skewed data means you need a larger
Calculate the sample mean

Confidence Interval:

Continuing from above…
Choose an \(\alpha\)
Construct the confidence interval
The confidence interval is a range that has a \((1 - \alpha)\) percent chance of capturing the true population mean

Hypothesis Testing:

Continuing from above…
Set up your \(H_{0}\) and \(H_{a}\) statements
Calculate the T-statistic
Using either statistical software or the T-Table, use either the p-value or critical value method to determine whether or not to reject your null hypothesis
Interpret your results

12.3.6 Doing It In R

We can do T-tests in base R using t.test(). In this example, I simulate our observations from a beta distribution with shape parameters \(\alpha = 1\) and \(\beta = 5\), resulting in a skewed distribution. We know that the true population mean of such a distribution is \(\frac{1}{(1 + 5)} \approx 0.1667\). \(x\) in the below is a vector of sample data from this beta distribution.

In the example below, we build a confidence interval with \(\alpha = 0.05\). We also do a two-sided T-test with \(\alpha = 0.05\) and our hypothesis testing statements as:

\[H_o = 0.25\] \[H_a = \neq 0.25\]

set.seed(7)  # allows you to reproduce the results
x = rbeta(30, 1, 5)  # generate 30 observations from beta(1, 5)
t.test(x,  # vector of data
       alternative = "two.sided", # one or two sided test
       mu = 0.25,  # hypothesized population mean value
       conf.level  = 0.95  # confidence level, (1 - alpha)
       )

|  
|  	One Sample t-test
|  
|  data:  x
|  t = -2.3251, df = 29, p-value = 0.02727
|  alternative hypothesis: true mean is not equal to 0.25
|  95 percent confidence interval:
|   0.1117072 0.2411425
|  sample estimates:
|  mean of x 
|  0.1764248

Confidence Interval:

The output shows what our confidence interval and sample mean are. Our sample mean is \(0.1764\) and confidence interval ranges from \(0.117\) to \(0.2411\). This implies that there is a 95% chance that our confidence interval contains the true population mean.

Hypothesis Testing: Using the critical value method, we see that our T-statistic is \(-2.3252\), which is larger in absolute value than the critical value of \(2.045\), meaning we reject the null hypothesis. This means it is unlikely that the true population mean is \(0.25\) given our sample mean of \(0.1764\). Using the p-value method, we see that our p-value is \(0.0273\), which is less than our \(\alpha\) of \(0.05\). This also means we reject the null hypothesis.

12.3.7 Sources and Useful Links

Sources:

Degroot: Probability And Statistics (4th) - DeGroot, Schervish

Useful Links: