[ML&DL] 7. Multiple Hypothesis Testing

KBC·2024년 12월 14일

machine learning

Machine Learning and Deep Learning

목록 보기

7/11

Multiple Hypothesis Testing

This session focuses on multiple hypothesis testing
A single null hypothesis might look like

$\mathcal{H}_0$ : the expected blood pressures of mice in the control and treatment groups are the same
We will now consider testing $m$ null hypotheses, $H_{01},\dots,H_{0m}$
where e.g.

$\mathcal{H}_{0j}$ : the expected values of the $j^{th}$ biomarker among mice in the control and treatment groups are equal
In this setting, we need to be careful to avoid incorrectly rejecting too many null hypotheses,
i.e. having too many false positives

A Quick Review of Hypothesis Testing

Hypothesis tests allow us to answer simple yes-or-no questions, such as
- Is the true coefficient $\beta_j$ in a linear regression equal to zero?
- Does the expected blood pressure among mice in the treatment group equal the expected blood pressure among mice in the control group?
Hypothesis testing proceeds as follows :
1. Define the null and alternative hypotheses
2. Construct the test statistic
3. Compute the $p$ -value
4. Decide whether to reject the null hypothesis

1. Define the Null and Alternative Hypotheses

We divide the world into null and alternative hypotheses
The null hypothesis $\mathcal{H}_0$ , is the default state of belief about the world. For instance :
1. The true coefficient $\beta_j$ equals zero
2. There is no difference in the expected blood pressure
The alternative hypothesis $\mathcal{H}_a$ , represents something different and unexpected. For instnace :
1. The true coefficient $\beta_j$ is non-zero
2. There is a difference in the expected blood pressure

2. Construct the Test Statistic

The test statistis summarizes the extent to which our data are consistent with $\mathcal{H}_0$
Let $\hat \mu_t/\hat \mu_c$ respectively denote the average blood pressure for the $n_t/n_c$ mice in the treatment and control groups
To test $\mathcal{H}_0$ : $\mu_t=\mu_c$ , we use a two-sample $t$ -statistic $T=\frac{\hat \mu_t-\hat \mu_c}{s\sqrt{\frac{1}{n_t}+\frac{1}{n_c}}}\\[0.3cm] S:\text{total standard deviation}\\[0.3cm] =\frac{(N_t-1)S_t^2+(N_c-1)S_c^2}{n_t+n_c-2}$

3. Compute the p-value

The $p$ -value is the probability of observing a test statistic at least as extreme as the observed statistic, under the assumption that $\mathcal{H}_0$ is true
A small $p$ -value provides evidence against $\mathcal{H}_0$
Suppose we compute $T=2.33$ for our test of $\mathcal{H}_0:\mu_t=\mu_c$
Under $\mathcal{H}_0,\;T\sim\mathcal{N}(0,1)$ for a two-sample $t$ -statistic
The p-value is $0.02$ because, if $\mathcal{H}_0$ is true, we would only see $|T|$ this large $2\%$ of the time

4. Decide Whether to Reject Null Hypothesis, Part 1

A small $p$ -value indicates that such a large value of the test statistic is unlikely to occur under $\mathcal{H}_0$
So, a small $p$ -value provides evidence against $\mathcal{H}_0$
If the $p$ -value is sufficiently small, then we will want to reject $\mathcal{H}_0$
But how small is small enough? To answer this, we need to understand the Type 1 Error

4. Decide Whether to Reject Null Hypothesis, Part 2

The Type 1 Error rate is the probability of making a Type 1 Error
We want to ensure a small Type 1 Error rate
If we reject $\mathcal{H}_0$ when the p-value is less then $\alpha$ , then the Type 1 Error rate will be at most $\alpha$
So, we reject $\mathcal{H}_0$ when the p-value falls below some $\alpha$ : often we choose $\alpha$ to equal $0.05$ or $0.01$

Multiple Testing

Now suppose that we wish to test $m$ null hypotheses, $H_{01},\dots,H_{0m}$
Can we simply reject all null hypotheses for which the corresponding $p$ -value falls below $0.01?$
If we reject all null hypotheses for which the $p$ -value falls below $0.01$ , then how many Type 1 Error will be make?

A Thought Experiment

Suppose that we flip a fair coin ten times, and we wish to test
$\mathcal{H}_0:\text{the coin is fair}$
- We'll probably get approximately the same number of heads and tails
- The $p$ -value probably won't be small. We do not reject $\mathcal{H}_0$
But what if we flip $1,024$ fair coins ten times each?
- We'd except one coin (on average) to come up all tails
- The $p$ -values for the null hypothesis that this particular coin is fair is less than $0.002$ !
- So we would conclude it is not fair, i.e. we reject null hypothesis, even though it's a fair coin
If we test a lot of hypotheses, we are almost certain to get one very small $p$ -value by chance

The Challenge of Multiple Testing

Suppose we test $H_{01},\dots,H_{0m}$ , all of which are true, and reject any null hypothesis with a $p$ -value below $0.01$
Then we except to falsely reject approximately $0.01\times m$ null hypotheses
If $m=10,000$ , then we expect to falsely reject $100$ null hypotheses by chance!

That's a lot of Type 1 Errors, i.e. false positives

The Family-Wise Error Rate

The family-wise error rate (FWER) is the probability of making at least one Type 1 error when conducting $m$ hypothesis tests
$\text{FWER=}\Pr(V\geq1)$

Challenges in Controlloing the FWER

\text{FWER}=1-\Pr(\text{do not falsely reject any null hypotheses})

If the tests are independent and all $H_{0j}$ are true then $\text{FWER} = 1-\prod^m_{j=1}(1-\alpha)=1-(1-\alpha)^m$

The Bonferroni Correction

\text{FWER} = \Pr(\text{falsely reject at least one null hypothesis)}\leq\sum^m_{j=1}\Pr(A_j)

Where $A_j$ is the event that we falsely reject the $j$ th null hypothesis
If we only reject hypotheses when the $p$ -value is less than $\alpha/m$ , then $\text{FWER}\leq\sum^m_{j=1}\Pr(A_j)\leq\sum^m_{j=1}\frac{\alpha}{m}=m\times\frac{\alpha}{m}=\alpha$
This is the Bonferroni Correction : to control FWER at level $\alpha$ , reject any null hypothesis with $p$ -value below $\alpha/m$

Holm's Method for Controlling the FWER

Compute $p$ -values, $p_1,\dots,p_m$ for the $m$ null hypotheses $H_{01},\dots,H_{0m}$
Order the $m$ $p$ -values so that $p_{(1)}\leq p_{(2)}\leq\cdots\leq p_{(m)}$
Define $L=\min\left\{j:p_{(j)}>\frac{\alpha}{m+1-j}\right\}$
Reject all null hypoteses $H_{0j}$ for which $p_j<p_{(L)}$
- Holm's method controls the FWER at level $\alpha$

Holm's Method on the Fund Manager Data

The ordered $p$ -values are $p_{(1)}=0.006,\;p_{(2)}=0.012,\;p_{(3)}=0.601,\;p_{(4)}=0.756,\;p_{(5)}=0.918$
The Holm procedure rejects the first two null hypotheses, because $p_{(1)}=0.006<0.05/(5+1-1)=0.0100\\[0.2cm] p_{(2)}=0.012<0.05/(5+1-2)=0.0125\\[0.2cm] p_{(3)}=0.601>0.05/(5+1-3)=0.0167$
Holm rejects $\mathcal{H}_0$ for the first and third managers, but Bonferroni only rejects $\mathcal{H}_0$ for the first manager

Comparison with m=10 p-values

Aim to control FWER at $0.05$
$p$ -values below the balck horizontal line are rejected by Bonferroni
$p$ -values below the blue line are rejected by Holm
Holm and Bonferroni make the same conclusion on the black points, but only Holm rejects for the red point

A More Extreme Example

Now five hypotheses are rejected by Holm but not by Bonferroni ...
even though both control FWER at $0.05$

Holm or Bonferroni?

Bonferroni is simple : reject any null hypothesis with a $p$ -value below $\alpha/m$
Holm is slightly more complicated, but it will lead to more rejections while controlling FWER

So, Holm is a better choice

The False Discovery Rate

Back to this table :
The FWER rate focuses on controlling $\Pr(V>1)$ , i.e., the probability of falsely rejecting any null hypothesis
This is a tough ask when $m$ is large. It will cause us to be super conservative(i.e. to very rarely reject)
Instead, we can control the false discovery rate $\text{FDR}=E(V/R)$

Intuition Behind the False Discovery Rate

\text{FDR}=E(V/R)=E\left(\frac{\text{number of false rejections}}{\text{total number of rejections}}\right)

A scientist conducts a hypothesis test on each of $m=20,000$ drug candidates
She wants to identify a smaller set of promising candidates to investigate further
She wants reassurance that this smaller set is really promising, i.e. not too many falsely rejected $\mathcal{H}_0$ 's
FWER controls $\Pr(\text{at least one false rejection})$
FDR controls the fraction of candidates in the smaller set that are really false rejections.

Benjamini-Hochberg Procedure to Control FDR

Specify $q$ , the level at which to control the FDR
Compute $p$ -values $p_1,\dots,p_m$ for the null hypotheses $H_{01},\dots,H_{0m}$
Order the $p$ -values so that $p_{(1)}\leq\dots\leq p_{(m)}$
Define $L=\max\left\{j:p_{(j)}<qj/m\right\}$
Reject all null hypotheses $H_{0j}$ for which $p_j\leq p_{(L)}$
Then, FDR $\leq$ $q$

A Comparison of FDR vs FWER

Here, $p$ -values for $m=2,000$ null hypotheses are displayed
To control FWER at level $\alpha=0.1$ with Bonferroni : reject hypotheses below green line
To control FDR at level $q=0.1$ with Benjamini-Hochberg : reject hypothese shown in blue

Consider $m=5$ p-values from the Fund data :
$p_1=0.006,\;p_2=0.918,\;p_3=0.012,\;p_4=0.601,\;p_5=0.756$
To control FDR at level $q=0.05$ using Benjamini-Hochberg :
- Notice that $p_{(1)} <0.05/5,\;p_{(2)}<2\times0.05/5,\;p_{(5)}>5\times0.05/5$
- So, we reject $H_{01}$ and $H_{03}$
To control FWER at level $\alpha=0.05$ using Bonferroni :
- We reject any null hypothesis for which the $p$ -value is less than $0.05/5$
- So, we reject only $H_{01}$

Re-Sampling Approaches

So far, we have assumed that we want to test some null hypothesis $\mathcal{H}_0$ with some test statistic $T$ , and that we know the distribution of $T$ under $\mathcal{H}_0$
This allows us to compute the $p$ -value
What if this theoretical null distribution is unknown?

A Re-Sampling Approach for a Two-Sample t-Test

Suppose we want to test $H_0:E(X)=E(Y)$ versus $H_\alpha :E(X)\neq E(Y)$ , using $n_X$ independent observations from $X$ and $n_Y$ independent observations from $Y$
The two-sample $t$ -statistic takes the form $T=\frac{\hat \mu_X-\hat \mu_Y}{s\sqrt{1/n_X+1/n_Y}}$
If $n_X$ and $n_Y$ are large, then $T$ approximately follows a $\mathcal{N}(0,1)$ distribution under $\mathcal{H}_0$
If $n_X$ and $n_Y$ are small, then we don't know the theorectical null distribution of $T$
Let's take a permutation or re-sampling approach...

Compute the two-sample $t$ -statistic $T$ on the original data $x_1,\dots,x_{n_X}$ and $y_1,\dots,y_{n_Y}$
For $b=1,\dots,B$ (where $B$ is a large number, like $1,000$ ) :
2.1. Randomly shuffle the $n_X+n_Y$ observations
2.2. Call the first $n_X$ shuffled observations $x^*_1,\dots,x^*_{n_X}$ and call the remaining observations $y^*_1,\dots,y^*_{n_Y}$
2.3. Compute a two-sample $t$ -statistic on the shuffled data, and call it $T^{*b}$
The $p$ -value is given by $\frac{\sum^B_{b=1} 1_{(|T^{*b}|)\geq|T|}}{B}$

Theoretical $p$ -value is $0.041$ . Re-sampling $p$ -value is $0.042$
Theoretical $p$ -value is $0.571$ . Re-sampling $p$ -value is $0.673$

[ML&DL] 7. Multiple Hypothesis Testing

Machine Learning and Deep Learning

Multiple Hypothesis Testing

A Quick Review of Hypothesis Testing

1. Define the Null and Alternative Hypotheses

2. Construct the Test Statistic

3. Compute the p-value

4. Decide Whether to Reject Null Hypothesis, Part 1

4. Decide Whether to Reject Null Hypothesis, Part 2

Multiple Testing

A Thought Experiment

The Challenge of Multiple Testing

The Family-Wise Error Rate

Challenges in Controlloing the FWER

The Bonferroni Correction

Holm's Method for Controlling the FWER

Holm's Method on the Fund Manager Data

Comparison with m=10 p-values

A More Extreme Example

Holm or Bonferroni?

The False Discovery Rate

Intuition Behind the False Discovery Rate

Benjamini-Hochberg Procedure to Control FDR

A Comparison of FDR vs FWER

Re-Sampling Approaches

A Re-Sampling Approach for a Two-Sample t-Test

More on Re-Sampling Approaches

[ML&DL] 6. Tree Based Methods

[ML&DL] 8. Support Vector Machines

0개의 댓글