[ML&DL] 7. Multiple Hypothesis Testing

KBC·2024년 12월 14일
0

Multiple Hypothesis Testing

  • This session focuses on multiple hypothesis testing
  • A single null hypothesis might look like

    H0\mathcal{H}_0 : the expected blood pressures of mice in the control and treatment groups are the same

  • We will now consider testing mm null hypotheses, H01,,H0mH_{01},\dots,H_{0m}
    where e.g.

    H0j\mathcal{H}_{0j} : the expected values of the jthj^{th} biomarker among mice in the control and treatment groups are equal

  • In this setting, we need to be careful to avoid incorrectly rejecting too many null hypotheses,
    i.e. having too many false positives

A Quick Review of Hypothesis Testing

  • Hypothesis tests allow us to answer simple yes-or-no questions, such as
    • Is the true coefficient βj\beta_j in a linear regression equal to zero?
    • Does the expected blood pressure among mice in the treatment group equal the expected blood pressure among mice in the control group?
  • Hypothesis testing proceeds as follows :
    1. Define the null and alternative hypotheses
    2. Construct the test statistic
    3. Compute the pp-value
    4. Decide whether to reject the null hypothesis

1. Define the Null and Alternative Hypotheses

  • We divide the world into null and alternative hypotheses
  • The null hypothesis H0\mathcal{H}_0, is the default state of belief about the world. For instance :
    1. The true coefficient βj\beta_j equals zero
    2. There is no difference in the expected blood pressure
  • The alternative hypothesis Ha\mathcal{H}_a, represents something different and unexpected. For instnace :
    1. The true coefficient βj\beta_j is non-zero
    2. There is a difference in the expected blood pressure

2. Construct the Test Statistic

  • The test statistis summarizes the extent to which our data are consistent with H0\mathcal{H}_0
  • Let μ^t/μ^c\hat \mu_t/\hat \mu_c respectively denote the average blood pressure for the nt/ncn_t/n_c mice in the treatment and control groups
  • To test H0\mathcal{H}_0 : μt=μc\mu_t=\mu_c, we use a two-sample tt-statistic
    T=μ^tμ^cs1nt+1ncS:total standard deviation=(Nt1)St2+(Nc1)Sc2nt+nc2T=\frac{\hat \mu_t-\hat \mu_c}{s\sqrt{\frac{1}{n_t}+\frac{1}{n_c}}}\\[0.3cm] S:\text{total standard deviation}\\[0.3cm] =\frac{(N_t-1)S_t^2+(N_c-1)S_c^2}{n_t+n_c-2}

3. Compute the p-value

  • The pp-value is the probability of observing a test statistic at least as extreme as the observed statistic, under the assumption that H0\mathcal{H}_0 is true
  • A small pp-value provides evidence against H0\mathcal{H}_0
  • Suppose we compute T=2.33T=2.33 for our test of H0:μt=μc\mathcal{H}_0:\mu_t=\mu_c
  • Under H0,  TN(0,1)\mathcal{H}_0,\;T\sim\mathcal{N}(0,1) for a two-sample tt-statistic
  • The p-value is 0.020.02 because, if H0\mathcal{H}_0 is true, we would only see T|T| this large 2%2\% of the time

4. Decide Whether to Reject Null Hypothesis, Part 1

  • A small pp-value indicates that such a large value of the test statistic is unlikely to occur under H0\mathcal{H}_0
  • So, a small pp-value provides evidence against H0\mathcal{H}_0
  • If the pp-value is sufficiently small, then we will want to reject H0\mathcal{H}_0
  • But how small is small enough? To answer this, we need to understand the Type 1 Error

4. Decide Whether to Reject Null Hypothesis, Part 2

  • The Type 1 Error rate is the probability of making a Type 1 Error
  • We want to ensure a small Type 1 Error rate
  • If we reject H0\mathcal{H}_0 when the p-value is less then α\alpha, then the Type 1 Error rate will be at most α\alpha
  • So, we reject H0\mathcal{H}_0 when the p-value falls below some α\alpha : often we choose α\alpha to equal 0.050.05 or 0.010.01

Multiple Testing

  • Now suppose that we wish to test mm null hypotheses, H01,,H0mH_{01},\dots,H_{0m}
  • Can we simply reject all null hypotheses for which the corresponding pp-value falls below 0.01?0.01?
  • If we reject all null hypotheses for which the pp-value falls below 0.010.01, then how many Type 1 Error will be make?

A Thought Experiment

  • Suppose that we flip a fair coin ten times, and we wish to test

    H0:the coin is fair\mathcal{H}_0:\text{the coin is fair}

    • We'll probably get approximately the same number of heads and tails
    • The pp-value probably won't be small. We do not reject H0\mathcal{H}_0
  • But what if we flip 1,0241,024 fair coins ten times each?
    • We'd except one coin (on average) to come up all tails
    • The pp-values for the null hypothesis that this particular coin is fair is less than 0.0020.002!
    • So we would conclude it is not fair, i.e. we reject null hypothesis, even though it's a fair coin
  • If we test a lot of hypotheses, we are almost certain to get one very small pp-value by chance

The Challenge of Multiple Testing

  • Suppose we test H01,,H0mH_{01},\dots,H_{0m}, all of which are true, and reject any null hypothesis with a pp-value below 0.010.01
  • Then we except to falsely reject approximately 0.01×m0.01\times m null hypotheses
  • If m=10,000m=10,000, then we expect to falsely reject 100100 null hypotheses by chance!

    That's a lot of Type 1 Errors, i.e. false positives

The Family-Wise Error Rate

  • The family-wise error rate (FWER) is the probability of making at least one Type 1 error when conducting mm hypothesis tests
  • FWER=Pr(V1)\text{FWER=}\Pr(V\geq1)

Challenges in Controlloing the FWER

FWER=1Pr(do not falsely reject any null hypotheses)\text{FWER}=1-\Pr(\text{do not falsely reject any null hypotheses})
  • If the tests are independent and all H0jH_{0j} are true then
    FWER=1j=1m(1α)=1(1α)m\text{FWER} = 1-\prod^m_{j=1}(1-\alpha)=1-(1-\alpha)^m

The Bonferroni Correction

FWER=Pr(falsely reject at least one null hypothesis)j=1mPr(Aj)\text{FWER} = \Pr(\text{falsely reject at least one null hypothesis)}\leq\sum^m_{j=1}\Pr(A_j)
  • Where AjA_j is the event that we falsely reject the jjth null hypothesis
  • If we only reject hypotheses when the pp-value is less than α/m\alpha/m, then
    FWERj=1mPr(Aj)j=1mαm=m×αm=α\text{FWER}\leq\sum^m_{j=1}\Pr(A_j)\leq\sum^m_{j=1}\frac{\alpha}{m}=m\times\frac{\alpha}{m}=\alpha
  • This is the Bonferroni Correction : to control FWER at level α\alpha, reject any null hypothesis with pp-value below α/m\alpha/m

Holm's Method for Controlling the FWER

  1. Compute pp-values, p1,,pmp_1,\dots,p_m for the mm null hypotheses H01,,H0mH_{01},\dots,H_{0m}
  2. Order the mm pp-values so that p(1)p(2)p(m)p_{(1)}\leq p_{(2)}\leq\cdots\leq p_{(m)}
  3. Define
    L=min{j:p(j)>αm+1j}L=\min\left\{j:p_{(j)}>\frac{\alpha}{m+1-j}\right\}
  4. Reject all null hypoteses H0jH_{0j} for which pj<p(L)p_j<p_{(L)}
    • Holm's method controls the FWER at level α\alpha

Holm's Method on the Fund Manager Data

  • The ordered pp-values are p(1)=0.006,  p(2)=0.012,  p(3)=0.601,  p(4)=0.756,  p(5)=0.918p_{(1)}=0.006,\;p_{(2)}=0.012,\;p_{(3)}=0.601,\;p_{(4)}=0.756,\;p_{(5)}=0.918
  • The Holm procedure rejects the first two null hypotheses, because
    p(1)=0.006<0.05/(5+11)=0.0100p(2)=0.012<0.05/(5+12)=0.0125p(3)=0.601>0.05/(5+13)=0.0167p_{(1)}=0.006<0.05/(5+1-1)=0.0100\\[0.2cm] p_{(2)}=0.012<0.05/(5+1-2)=0.0125\\[0.2cm] p_{(3)}=0.601>0.05/(5+1-3)=0.0167
  • Holm rejects H0\mathcal{H}_0 for the first and third managers, but Bonferroni only rejects H0\mathcal{H}_0 for the first manager

Comparison with m=10 p-values

  • Aim to control FWER at 0.050.05
  • pp-values below the balck horizontal line are rejected by Bonferroni
  • pp-values below the blue line are rejected by Holm
  • Holm and Bonferroni make the same conclusion on the black points, but only Holm rejects for the red point

A More Extreme Example

  • Now five hypotheses are rejected by Holm but not by Bonferroni ...
  • even though both control FWER at 0.050.05

Holm or Bonferroni?

  • Bonferroni is simple : reject any null hypothesis with a pp-value below α/m\alpha/m
  • Holm is slightly more complicated, but it will lead to more rejections while controlling FWER

    So, Holm is a better choice

The False Discovery Rate

  • Back to this table :
  • The FWER rate focuses on controlling Pr(V>1)\Pr(V>1), i.e., the probability of falsely rejecting any null hypothesis
  • This is a tough ask when mm is large. It will cause us to be super conservative(i.e. to very rarely reject)
  • Instead, we can control the false discovery rate
    FDR=E(V/R)\text{FDR}=E(V/R)

Intuition Behind the False Discovery Rate

FDR=E(V/R)=E(number of false rejectionstotal number of rejections)\text{FDR}=E(V/R)=E\left(\frac{\text{number of false rejections}}{\text{total number of rejections}}\right)
  • A scientist conducts a hypothesis test on each of m=20,000m=20,000 drug candidates
  • She wants to identify a smaller set of promising candidates to investigate further
  • She wants reassurance that this smaller set is really promising, i.e. not too many falsely rejected H0\mathcal{H}_0's
  • FWER controls Pr(at least one false rejection)\Pr(\text{at least one false rejection})
  • FDR controls the fraction of candidates in the smaller set that are really false rejections.

Benjamini-Hochberg Procedure to Control FDR

  1. Specify qq, the level at which to control the FDR
  2. Compute pp-values p1,,pmp_1,\dots,p_m for the null hypotheses H01,,H0mH_{01},\dots,H_{0m}
  3. Order the pp-values so that p(1)p(m)p_{(1)}\leq\dots\leq p_{(m)}
  4. Define L=max{j:p(j)<qj/m}L=\max\left\{j:p_{(j)}<qj/m\right\}
  5. Reject all null hypotheses H0jH_{0j} for which pjp(L)p_j\leq p_{(L)}
    Then, FDR \leq qq

A Comparison of FDR vs FWER

  • Here, pp-values for m=2,000m=2,000 null hypotheses are displayed
  • To control FWER at level α=0.1\alpha=0.1 with Bonferroni : reject hypotheses below green line
  • To control FDR at level q=0.1q=0.1 with Benjamini-Hochberg : reject hypothese shown in blue

  • Consider m=5m=5 p-values from the Fund data :
    p1=0.006,  p2=0.918,  p3=0.012,  p4=0.601,  p5=0.756p_1=0.006,\;p_2=0.918,\;p_3=0.012,\;p_4=0.601,\;p_5=0.756
  • To control FDR at level q=0.05q=0.05 using Benjamini-Hochberg :
    • Notice that p(1)<0.05/5,  p(2)<2×0.05/5,  p(5)>5×0.05/5p_{(1)} <0.05/5,\;p_{(2)}<2\times0.05/5,\;p_{(5)}>5\times0.05/5
    • So, we reject H01H_{01} and H03H_{03}
  • To control FWER at level α=0.05\alpha=0.05 using Bonferroni :
    • We reject any null hypothesis for which the pp-value is less than 0.05/50.05/5
    • So, we reject only H01H_{01}

Re-Sampling Approaches

  • So far, we have assumed that we want to test some null hypothesis H0\mathcal{H}_0 with some test statistic TT, and that we know the distribution of TT under H0\mathcal{H}_0
  • This allows us to compute the pp-value
  • What if this theoretical null distribution is unknown?

A Re-Sampling Approach for a Two-Sample t-Test

  • Suppose we want to test H0:E(X)=E(Y)H_0:E(X)=E(Y) versus Hα:E(X)E(Y)H_\alpha :E(X)\neq E(Y), using nXn_X independent observations from XX and nYn_Y independent observations from YY
  • The two-sample tt-statistic takes the form
    T=μ^Xμ^Ys1/nX+1/nYT=\frac{\hat \mu_X-\hat \mu_Y}{s\sqrt{1/n_X+1/n_Y}}
  • If nXn_X and nYn_Y are large, then TT approximately follows a N(0,1)\mathcal{N}(0,1) distribution under H0\mathcal{H}_0
  • If nXn_X and nYn_Y are small, then we don't know the theorectical null distribution of TT
  • Let's take a permutation or re-sampling approach...
  1. Compute the two-sample tt-statistic TT on the original data x1,,xnXx_1,\dots,x_{n_X} and y1,,ynYy_1,\dots,y_{n_Y}
  2. For b=1,,Bb=1,\dots,B(where BB is a large number, like 1,0001,000) :
    2.1. Randomly shuffle the nX+nYn_X+n_Y observations
    2.2. Call the first nXn_X shuffled observations x1,,xnXx^*_1,\dots,x^*_{n_X} and call the remaining observations y1,,ynYy^*_1,\dots,y^*_{n_Y}
    2.3. Compute a two-sample tt-statistic on the shuffled data, and call it TbT^{*b}
  3. The pp-value is given by
    b=1B1(Tb)TB\frac{\sum^B_{b=1} 1_{(|T^{*b}|)\geq|T|}}{B}
  • Theoretical pp-value is 0.0410.041. Re-sampling pp-value is 0.0420.042
  • Theoretical pp-value is 0.5710.571. Re-sampling pp-value is 0.6730.673

More on Re-Sampling Approaches

  • Re-sampling approaches are useful if the theoretical null distribution is unavailable, or requires stringent assumptions
  • An extension of the re-sampling approach to compute a pp-value can be used to control FDR
  • This example involved a two-sample tt-test, but similar approaches can be developed for other test statistics

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글