MTP package:multtest R Documentation
_A _f_u_n_c_t_i_o_n _t_o _p_e_r_f_o_r_m _r_e_s_a_m_p_l_i_n_g-_b_a_s_e_d _m_u_l_t_i_p_l_e _h_y_p_o_t_h_e_s_i_s _t_e_s_t_i_n_g
_D_e_s_c_r_i_p_t_i_o_n:
A user-level function to perform multiple testing procedures
(MTP). A variety of t- and f-tests, including robust versions of
each test, are implemented. Single-step and step-down minP and
maxT methods are used to control the chosen type I error rate
(FWER, gFWER, TPPFP, or FDR). Bootstrap and permutation null
distributions are available. Arguments are provided for user
control of output. Gene selection in microarray experiments is one
application.
_U_s_a_g_e:
MTP(X, W = NULL, Y = NULL, Z = NULL, Z.incl = NULL, Z.test = NULL,
na.rm = TRUE, test = "t.twosamp.unequalvar", robust = FALSE,
standardize = TRUE, alternative = "two.sided", psi0 = 0, typeone = "fwer",
k = 0, q = 0.1, fdr.method = "conservative", alpha = 0.05, smooth.null =
FALSE, nulldist = "boot", B = 1000, method = "ss.maxT", get.cr = FALSE,
get.cutoff = FALSE, get.adjp = TRUE, keep.nulldist = TRUE, seed = NULL)
_A_r_g_u_m_e_n_t_s:
X: A matrix or data.frame containing the raw data. For currently
implemented tests, one hypothesis is tested for each row of
the data.
W: A vector or matrix containing non-negative weights to be used
in computing the test statistics. If a matrix, 'W' must be
the same dimension as 'X' with one weight for each value in
'X'. If a vector, 'W' may contain one weight for each
observation (i.e. column) of 'X' or one weight for each
variable (i.e. row) of 'X'. In either case, the weights are
duplicated apporpraiately. Weighted f-tests are not
available. Default is 'NULL'.
Y: A vector, factor, or 'Surv' object containing the outcome of
interest. This may be class labels (f-tests and two sample
t-tests) or a continuous or polycotomous dependent variable
(linear regression based t-tests), or survival data (Cox
proportional hazards based t-tests). Default is 'NULL'.
Z: A vector, factor, or matrix containing covariate data to be
used in the regression (linear and Cox) models. Each variable
should be in one column, so that 'nrow(Z)=ncol(X)'. The
variables 'Z.incl' and 'Z.adj' allow one to specify which
covariates to use in a particular test without modifying the
input 'Z'. Default is 'NULL'.
Z.incl: The indices of the columns of 'Z' (i.e. which variables) to
include in the model. These can be numbers or column names
(if the columns are named). Default is 'NULL'.
Z.test: The index or names of the column of 'Z' (i.e. which variable)
to use to test for association with each row of 'X' in a
linear model. Only used for 'test="lm.XvsZ"', where it is
necessary to specify which covariate's regression parameter
is of interest. Default is 'NULL'.
na.rm: Logical indicating whether to remove observations with an NA.
Default is 'TRUE'.
test: Character string specifying the test statistics to use, by
default 't.twosamp.unequalvar'. See details (below) for a
list of tests.
robust: Logical indicating whether to use the robust version of the
chosen test, e.g. Wilcoxon singed rank test for robust
one-sample t-test or 'rlm' instead of 'lm' in linear models.
Default is 'FALSE'.
standardize: Logical indicating whether to use the standardized version
of the test statistics (usual t-statistics are standardized).
Default is 'TRUE'.
alternative: Character string indicating the alternative hypotheses, by
default 'two.sided'. For one-sided tests, use 'less' or
'greater' for null hypotheses of 'greater than or equal'
(i.e. alternative is 'less') and 'less than or equal',
respectively.
psi0: The hypothesized null value, typically zero (default).
Currently, this should be a single value, which is used for
all hypotheses.
typeone: Character string indicating which type I error rate to
control, by default family-wise error rate ('fwer'). Other
options include generalized family-wise error rate ('gfwer'),
with parameter 'k' giving the allowed number of false
positives, and tail probability of the proportion of false
positives ('tppfp'), with parameter 'q' giving the allowed
proportion of false positives. The false discovery rate
('fdr') can also be conrtolled.
k: The allowed number of false positives for gFWER control.
Default is 0 (FWER).
q: The allowed proportion of false positives for TPPFP control.
Default is 0.1.
fdr.method: Character string indicating which FDR controlling method
should be used when 'typeone="fdr"'. The options are
"conservative" (default) for the more conservative, general
FDR controlling procedure and "restricted" for the method
which requires more assumptions.
alpha: The target nominal type I error rate, which may be a vector
of error rates. Default is 0.05.
smooth.null: Indicator of whether to use a kernal density estimate for
the tail of the null distributon for computing raw pvalues
close to zero. Only used if 'rawp' would be zero without
smoothing. Default is 'FALSE'.
nulldist: Character string indicating which resampling method to use
for estimating the joint test statistics null distribution,
by default non-parametric bootstrap ('boot').
B: The number of bootstrap iterations (i.e. how many resampled
data sets) or the number of permutations (if 'nulldist' is
'perm'). Can be reduced to increase the speed of computation,
at a cost to precision. Default is 1000.
method: The multiple testing procedure to use. Options are
single-step maxT ('ss.maxT', default), single-step minP
('ss.minP'), step-down maxT ('sd.maxT'), and step-down minP
('sd.minP').
get.cr: Logical indicating whether to compute confidence intervals
for the estimates. Not available for f-tests. Default is
'FALSE'.
get.cutoff: Logical indicating whether to compute thresholds for the
test statistics. Default is 'FALSE'.
get.adjp: Logical indicating whether to compute adjusted p-values.
Default is 'TRUE'.
keep.nulldist: Logical indicating whether to return the computed null
distribution, by default 'TRUE'. Note that this matrix can be
quite large.
seed: Integer to be used as argument to 'set.seed' to set the seed
for the random number generator for bootstrap resampling.
This argument can be used to repeat exactly a test performed
with a given seed. If the seed is specified via this
argument, the same seed will be returned in the seed slot of
the MTP object created. Else a random seed will be generated,
used and returned.
_D_e_t_a_i_l_s:
A multiple testing procedure (MTP) is defined by choices of test
statistics, type I error rate, null distribution and method for
error rate control. Each component is described here. See
references for more detail.
Test statistics are determined by the values of 'test':
_t._o_n_e_s_a_m_p: one-sample t-statistic for tests of means;
_t._t_w_o_s_a_m_p._e_q_u_a_l_v_a_r: equal variance two-sample t-statistic for
tests of differences in means (two-sample t-statistic);
_t._t_w_o_s_a_m_p._u_n_e_q_u_a_l_v_a_r: unequal variance two-sample t-statistic for
tests of differences in means (two-sample Welch t-statistic);
_t._p_a_i_r: two-sample paired t-statistic for tests of differences in
means;
_f: multi-sample f-statistic for tests of equality of population
means (assumes constant variance across groups, but not
normality);
_f._b_l_o_c_k: multi-sample f-statistic for tests of equality of
population means in a block design (assumes constant variance
across groups, but not normality);
_l_m._X_v_s_Z: t-statistic for tests of regression coefficients for
variable 'Z.test' in linear models, each with a row of X as
outcome, possibly adjusted by covariates 'Z.incl' from the
matrix 'Z' (in the case of no covariates, one recovers the
one-sample t-statistic, 't.onesamp');
_l_m._Y_v_s_X_Z: t-statistic for tests of regression coefficients in
linear models, with outcome Y and each row of X as covariate
of interest, with possibly other covariates 'Z.incl' from the
matrix 'Z';
_c_o_x_p_h._Y_v_s_X_Z: t-statistic for tests of regression coefficients in
Cox proportional hazards survival models, with outcome Y and
each row of X as covariate of interest, with possibly other
covariates 'Z.incl' from the matrix 'Z'.
When 'robust=TRUE', non-parametric versions of each test are
performed. For the linear models, this means 'rlm' is used instead
of 'lm'. There is not currently a robust version of
'test=coxph.YvsXZ'. For the t- and f-tests, data values are simply
replaced by their ranks. This is equivalent to performing the
following familiar named rank-based tests. The conversion after
each test is the formula to convert from the MTP test to the
statistic reported by the listed R function (where num is the
numerator of the MTP test statistics, n is total sample size, nk
is group k sample size, K is total number of groups or treatments,
and rk are the ranks in group k).
_t._o_n_e_s_a_m_p _o_r _t._p_a_i_r: Wilcoxon signed rank, 'wilcox.test' with
'y=NULL' or 'paired=TRUE',
conversion: num/n
_t._t_w_o_s_a_m_p._e_q_u_a_l_v_a_r: Wilcoxon rank sum or Mann-Whitney,
'wilcox.test',
conversion: n2*(num+mean(r1)) - n2*(n2+1)/2
_f: Kruskal-Wallis rank sum, 'kruskal.test',
conversion: num*12/(n*(n-1)
_f._b_l_o_c_k: Friedman rank sum, 'friedman.test',
conversion: num*12/(K*(K+1))
The implemented MTPs are based on control of the family-wise error
rate, defined as the probability of any false positives. Let Vn
denote the (unobserved) number of false positives. Then, control
of FWER at level alpha means that Pr(Vn>0)<=alpha. The set of
rejected hypotheses under a FWER controlling procedure can be
augmented to increase the number of rejections, while controlling
other error rates. The generalized family-wise error rate is
defined as Pr(Vn>k)<=alpha, and it is clear that one can simply
take an FWER controlling procedure, reject k more hypotheses and
have control of gFWER at level alpha. The tail probability of the
proportion of false positives depends on both the number of false
postives (Vn) and the number of rejections (Rn). Control of TPPFP
at level alpha means Pr(Vn/Rn>q)<=alpha, for some proportion q.
Control of the false discovery rate refers to the expected
proportion of false positives (rather than a tail probability).
Control of FDR at level alpha means E(Vn/Rn)<=alpha.
In practice, one must choose a method for estimating the test
statistics null distribution. We have implemented an ordinary
non-parametric bootstrap estimator and a permutation estimator
(which makes sense in certain settings, see references). The
non-parametric bootstrap estimator (default) provides asymptotic
control of the type I error rate for any data generating
distribution, whereas the permutation estimator requires the
subset pivotality assumption. One draw back of both methods is the
discreteness of the estimated null distribution when the sample
size is small. Furthermore, when the sample size is small enough,
it is possible that ties will lead to a very small variance
estimate. Using 'standardize=FALSE' allows one to avoid these
unusually small test statistic denominators. Parametric bootstrap
estimators are another option (not yet implemented).
Given observed test statistics, a type I error rate (with nominal
level), and a test statistics null distribution, MTPs provide
adjusted p-values, cutoffs for test statistics, and possibly
confidence regions for estimates. Four methods are implemented,
based on minima of p-values and maxima of test statistics. Only
the step down methods are currently available with the permutation
null distribution.
_V_a_l_u_e:
An object of class 'MTP', with the following slots:
'statistic': Object of class 'numeric', observed test statistics for
each hypothesis, specified by the values of the 'MTP'
arguments 'test', 'robust', 'standardize', and 'psi0'.
'estimate': For the test of single-parameter null hypotheses using
t-statistics (i.e., not the F-tests), the numeric vector of
estimated parameters corresponding to each hypothesis, e.g.
means, differences in means, regression parameters.
'sampsize': Object of class 'numeric', number of columns (i.e.
observations) in the input data set.
'rawp': Object of class 'numeric', unadjusted, marginal p-values for
each hypothesis.
'adjp': Object of class 'numeric', adjusted (for multiple testing)
p-values for each hypothesis (computed only if the 'get.adjp'
argument is TRUE).
'conf.reg': For the test of single-parameter null hypotheses using
t-statistics (i.e., not the F-tests), the numeric array of
lower and upper simultaneous confidence limits for the
parameter vector, for each value of the nominal Type I error
rate 'alpha' (computed only if the 'get.cr' argument is
TRUE).
'cutoff': The numeric matrix of cut-offs for the vector of test
statistics for each value of the nominal Type I error rate
'alpha' (computed only if the 'get.cutoff' argument is TRUE).
'reject': Object of class '"matrix"', rejection indicators (TRUE for a
rejected null hypothesis), for each value of the nominal Type
I error rate 'alpha'.
'nulldist': The numeric matrix for the estimated test statistics null
distribution (returned only if 'keep.nulldist=TRUE'; option
not currently available for permutation null distribution,
i.e., 'nulldist="perm"'). By default (i.e., for
'nulldist="boot"'), the entries of 'nulldist' are the null
value shifted and scaled bootstrap test statistics, with one
null test statistic value for each hypothesis (rows) and
bootstrap iteration (columns).
'call': Object of class 'call', the call to the MTP function.
'seed': An integer for specifying the state of the random number
generator used to create the resampled datasets. The seed can
be reused for reproducibility in a repeat call to 'MTP'. This
argument is currently used only for the bootstrap null
distribution (i.e., for 'nulldist="boot"'). See '? set.seed'
for details.
_A_u_t_h_o_r(_s):
Katherine S. Pollard,
with design contributions from Sandrine Dudoit and Mark J. van
der Laan.
_R_e_f_e_r_e_n_c_e_s:
M.J. van der Laan, S. Dudoit, K.S. Pollard (2004), Augmentation
Procedures for Control of the Generalized Family-Wise Error Rate
and Tail Probabilities for the Proportion of False Positives,
Statistical Applications in Genetics and Molecular Biology, 3(1).
M.J. van der Laan, S. Dudoit, K.S. Pollard (2004), Multiple
Testing. Part II. Step-Down Procedures for Control of the
Family-Wise Error Rate, Statistical Applications in Genetics and
Molecular Biology, 3(1).
S. Dudoit, M.J. van der Laan, K.S. Pollard (2004), Multiple
Testing. Part I. Single-Step Procedures for Control of General
Type I Error Rates, Statistical Applications in Genetics and
Molecular Biology, 3(1).
Katherine S. Pollard and Mark J. van der Laan, "Resampling-based
Multiple Testing: Asymptotic Control of Type I Error and
Applications to Gene Expression Data" (June 24, 2003). U.C.
Berkeley Division of Biostatistics Working Paper Series. Working
Paper 121.
Thank you to Peter Dimitrov for suggestions about the code.
_S_e_e _A_l_s_o:
'MTP-class', 'MTP-methods', 'mt.minP', 'mt.maxT', 'ss.maxT',
'fwer2gfwer'
_E_x_a_m_p_l_e_s:
#data
set.seed(99)
data<-matrix(rnorm(90),nr=9)
group<-c(rep(1,5),rep(0,5))
#fwer control with bootstrap null distribution (B=100 for speed)
m1<-MTP(X=data,Y=group,alternative="less",B=100,method="sd.minP")
print(m1)
summary(m1)
par(mfrow=c(2,2))
plot(m1,top=9)