- Statistical inference is the process of hypothesis testing and using data to make conclusions about characteristics of populations.
- Null hypothesis (H0) – assumes no effect in the population
- Alternative hypothesis (H1) – if the null hypothesis is not true
- Alternative hypotheses can be:
- One sided – state the direction of effect
- Two sided – do not state the direction of effect
- Hypothesis testing
- Hypothesis testing is the process of using a sample to decide whether to reject/accept a hypothesis about a population
- To test a hypothesis….
- Determine alternative hypothesis (H1/A) & state null hypothesis (H0/null)
- Decide on statistical analysis testing method
- Collect data
- Statistical analysis testing
- Draw conclusions based on results of the statistical analysis – accept or reject the hypothesis
- P value
- The definition of the p value is the probability, assuming the null hypothesis is true, of obtaining test results at least as extreme as the results actually observed.
- P <0.05 (usually used as statistical significance) means that there is 5% chance of coming to the wrong conclusion (this is the type 1 error rate) and you can reject the null hypothesis.
- 95% confidence interval (CI)
- If CI crosses 0, the p value will not be <0.05 & so null hypothesis cannot be rejected.
- CI gives information about – effect size & precision
(because CI is related to standard error – remember that: 95% Confidence Interval = mean +/- 1.96 x standard error)
- Confidence intervals and hypothesis testing give you valuable information…
- Hypothesis testing results in a p value to indicate whether a conclusion can be made about rejecting the null hypothesis
- CIs can quantify effect of interest & indicates precision
- A p value without a summary statistic and CI gives you no idea of the actual effect
- Statistical analysis testing – parametric or non parametric testing
- Parametric test – tests that makes distributional assumptions about the data. (assume data is normally distributed)
- Non parametric test – tests that make no distributional assumptions of the data (these have less power to detect real effect)
- Statistical vs clinical significance
- Findings may be statistically but not clinically significant and vice versa.
- For Example the p value of an analysis on the efficacy of a novel Covid drug might be <0.05 and the primary outcome was that this new drug decreased the time to recovery from viral infection by 1 day, which is statistically significant. However, is this clinically significant is a different question? Does this matter? Is this 1-day gain of clinical importance? Also make sure you’ve asked the right question: If your question is does it decrease the chance of dying/getting hospitalised due to the viral infection – then this hypothesis testing did nothing to answer the question.
Errors in hypothesis testing
|Results of statistical test|
|No difference detected between groups (accept null hypothesis)||Difference detected between groups (reject null hypothesis)|
|Reality||No difference detected between groups (null hypothesis is correct)||No error – correct acceptance of null hypothesis||Type 1 error|
|Difference detected between groups (null hypothesis is incorrect)||Type 2 error||No error – Correct rejection of null hypothesis|
- Type 1 error (false positive rate/significance level/α)
- Wrongly rejecting null hypothesis = assuming there is a difference when there actually is not
- You can reject null hypothesis if p<α
- α is typically set at 5% = <0.05 – hence you reject null hypothesis if p<0.05
- Type 2 error (false negative rate/β)
- Wrongly not rejecting null hypothesis = assuming there is no difference when there actually is
- Power is the inverse of a type 2 error = it is the ability to detect a difference if there really is one
- Power = 1 – β (i.e. if power is 0.8/80% – type 2 error risk is 0.2 or 20%)
- This is set up front at the design stage
- Factors that influence power (for a given test):
- Sample size – ↑ sample size = ↑ power
- Variability of observations – ↓ variability = ↑ power
- Effect of interest – ↑ effect size = ↑ power
- Significance level – ↑ α ↓ β = ↑ power
- Wide confidence intervals suggest poor power
The multiple comparison problem
- Type 1 error rate (false positive rate) increases dramatically as number of comparisons increases
- Situations which involve multiple comparisons include:
- Subgroup analyses – splitting participant data into subgroups to make comparisons between them
- Multiple comparisons for a single outcome variable – when there are ≥3 treatment groups to analyse [1 vs 2, 1 vs 3, 2 vs 3] or ≥3 time points when response has been assessed
- Multiple outcome variables – when several different endpoints are evaluated to determine a treatment effect
- Interim analyses – when comparisons are made at a predetermined intermediate stage of a study
- Data dredging – making comparisons to look for relationships in data, when there has been no specification to this specific relationship before the study started
- Options or solutions to this:
- Ignore it
- Statistical controls for type 1 error:
- Bonferroni correction – α/significance level divided by number of hypothesis
- The more hypothesis you test on a sample data – the larger the chance of finding a rare event and increased risk of α error. For example, say you want to do a regression analysis for factors associated with passing FRCR1. You might have 20 variables you want to test including medical school, training programme, age etc. Hence you are testing 20 hypotheses. Neglecting the Bonferroni correction (as is often done) results in an inflated chance of finding an association that does not truly exist (or alpha error).
- Benjamini-hochberg procedure
- Sidak correction