The hdme
package contains algorithms for regression
problems with measurement error when the number of covariates p is on the same order as the number
of samples n, or even
larger.
Consider a linear regression model, y = Xβ + ϵ, where y is an n-dimensional response vector, X is an n × p design matrix, and ϵ is normally distributed error. With n > p (and X positive definite), unbiased regression coefficients are obtained as
β̂ = (XTX)−1XTy.
In many cases, however, the true covariates X are not directly observed. Instead, we have noisy measurements
W = X + U,
where W is the n × p measurement matrix, and U is the n × p matrix of measurement errors. Assume for the moment that the n rows of U, ui are identically and independently distributed with covariance matrix Σuu, i.e.,
ui ∼ N(0, Σuu), i = 1, …, n.
Since we do not have measurements of X, using a classical linear regression model now yields coefficient estimates
β̂naive = (WTW)−1WTy, which is referred to as the naive estimate in the measurement error literature. Naive estimates are typically biased. For example, when all measurement errors are uncorrelated and each have variance σu, i.e., Σuu = σuIp × p, the expected value of β̂ is
$$ E(\hat{\beta}_{naive}) = \frac{\sigma_{x}^{2}}{\sigma_{u}^{2} + \sigma_{x}^2} \beta = \gamma \beta, $$ where γ is the attenuation factor. Unbiased estimates for the case of linear regression can be obtained by minimizing the corrected loss function Lcorr(β) = ∥y − Wβ∥2 − βTΣuuβ. This Lcorr(β) is not always convex. If the Hessian WTW − Σuu is positive definite, Lcorr(β) is convex, and the estimates can be found using
β̂corr = (WTW − Σuu)−1WTy. Otherwise, iteration methods must be used to find a minimum of L(β).
An estimate Σ̂uu of Σuu can be typically obtained with replicate measurements.
Similar results hold for generalized linear models, e.g., logistic and Poisson regression: Measurement error typically leads to bias in the naive estimates, and correction methods can be used to obtained unbiased estimates. We refer to the book Carroll et al. (2006) for a thorough introduction to measurement error modeling.
Now consider the same setup as in the previous section, but with high-dimensional data. That is, the number of covariates p is either on the same order as n, or even larger than n. In this case, even in the absence of measurement error, regularization is typically needed (see, e.g., Chapter 3 of Hastie, Tibshirani, and Friedman (2009)).
The lasso (Tibshirani (1996)) performs
L1-regularization, which
induces sparsity. A popular R
implementation can be found
in the glmnet
package (Simon et al.
(2011)). For linear regression models, the lasso finds β̂ minimizing the loss L(β) = ∥y − Xβ∥2 + λ∥β∥1,
where λ is a regularization
parameter. The lasso can also be formulated as a constrained
optimization problem, minimize
∥y − Xβ∥2 subject to
∥β∥1 ≤ R, where R is in a one-to-one relationship
with λ.
A lot of effort has been put into understanding the statistical properties of the lasso, both in the classical p < n case and in the high-dimensional p > n case, summarized in Bühlmann and Geer (2011). The results are typically either probabilistic upper bounds on or asymptotic limits for
The impact of measurement error in the lasso for linear models has been studied by Sorensen, Frigessi, and Thoresen (2015). The authors show that estimation of β̂ in the asymptotic limit suffers the same bias as for a multivariate linear model described above. Consistent covariate selection, on the other hand, requires stricter conditions than in the case without measurement error.
Using the glmnet
package, we can illustrate the impact
of measurement error in a small experiment.
First we set up the parameters and generate random data. In order to
repeat the procedure, we create the function
create_example_data
for doing this. For illustration
purposes, we set the true coefficient vector to β = (−2, −1, 0.5, 1, 2, 0, …, 0)T.
create_example_data <- function(n, p, s = 5, sdX = 1, sdU = 0.5,
sdEpsilon = 0.1, family = "gaussian") {
# Independent true covariates with mean zero and standard deviation sdX
X <- matrix(rnorm(n * p, sd = sdX), nrow = n, ncol = p)
# if Gaussian, response has standard deviation sdEpsilon, and zero intercept
# if binomial, response is binomial with mean (1 + exp(-X %*% beta))^(-1)
beta <- c(-2, -1, 0.5, 1, 2, rep(0, p - s))
if(family == "gaussian") {
# True coefficient vector has s non-zero elements and p-s zero elements
y <- X %*% beta + rnorm(n, sd = sdEpsilon)
} else if (family == "binomial") {
# Need an amplification in the binomial case
beta <- beta * 3
y <- rbinom(n, size = 1, prob = (1 + exp(-X %*% beta))**(-1))
}
# The measurements W have mean X and standard deviation sdU.
# We assume uncorrelated measurement errors
W <- X + matrix(rnorm(n * p, sd = sdU), nrow = n, ncol = p)
return(list(X = X, W = W, y = y, beta = beta, sigmaUU = diag(p) * sdU))
}
We then call the function to get the data and put the result in the
list ll
.
Next we run the lasso with cross-validation on the true covariates
and on the noisy measurements. We pick the coefficient estimate at λ1se,
i.e., on standard error on the sparse side of the cross-validation
minimum. This is the default of the coef
function for
objects of class cv.glmnet
.
library(glmnet)
library(dplyr)
# Lasso with cross-validation on data without measurement error
fit1 <- cv.glmnet(ll$X, ll$y)
# Lasso with cross-validation on data with measurement error
fit2 <- cv.glmnet(ll$W, ll$y)
# Create a data frame with results ([-1] because we drop the intercept)
lassoEstimates <- tibble(
index = rep(1:p, times = 3),
beta = c(ll$beta, as.numeric(coef(fit1)[-1]), coef(fit2)[-1]),
label = c(rep("True values", p), rep("No measurement error", p), rep("Measurement error", p))
)
By plotting all the estimated regression coefficients, we see that when the data are subject to measurement error, the number of false positives may increase. Note that this is not necessarily the case for all choices of parameters.
library(ggplot2)
theme_set(theme_bw())
theme_update(
panel.grid = element_blank()
)
ggplot(lassoEstimates, aes(x = index, y = beta, color = label)) +
geom_point() +
xlab("p") +
scale_color_brewer(type = "qualitative", palette = "Paired") +
theme(legend.title=element_blank()) +
ggtitle("Measurement error leading to false positives")
We can also focus on the 5 parameters which are truly nonzero. In this case, we see that in the absence of measurement error, the lasso estimates values quite close to truth. With measurement error, on the other hand, the attenuation is quite clear: the estimates are biased toward zero.
library(tidyr)
estimatesOfNonzero <- lassoEstimates %>%
spread(key = label, value = beta) %>%
filter(`True values` != 0) %>%
gather(key = label, value = beta, -index)
ggplot(estimatesOfNonzero, aes(x = index, y = beta, color = label)) +
geom_point() +
xlab("p") +
scale_color_brewer(type = "qualitative", palette = "Paired") +
theme(legend.title=element_blank()) +
ggtitle("Measurement error leading to attenuation")
The Dantzig selector Candes and Tao (2007) is closely related to the lasso. It is defined as the solution to the optimization problem minimize ∥β∥1, subject to (1/n)∥XT(y − Xβ)∥∞ ≤ λ, where ∥ ⋅ ∥∞ denotes the maximum component norm. A generalized Dantzig selector for generalized linear models was introduced by James and Radchenko (2009). It is defined as the solution to the optimization problem minimize ∥β∥1, subject to (1/n)∥XT(y − μ(Xβ))∥∞ ≤ λ, where μ(⋅) ∈ ℝn is the vector valued mean function of the generalized linear model. Examples include logistic regression with μ(x) = (1 + exp (−x))−1 and Poisson regression with μ(x) = exp (x). Using an iterative reweighing algorithm, the generalized Dantzig selector can be solved as a sequence of linear optimization problems of the same form as the Dantzig selector for linear models.
To our knowledge, no R package to date contains an implementation of
the Generalized Dantzig Selector (GDS). Since the Generalized Matrix
Uncertainty Selector (GMUS) presented later in this vignette is a
generalization of the GDS, we have included a function for computing the
GDS, called gds
.
# Number of samples
n <- 1000
# Number of covariates
p <- 50
# Create example data
ll <- create_example_data(n, p, family = "binomial")
Arguments to gds
. Only X and y are required.
Fit the generalized Dantzig selector on the data.
The result is a list of class gds
.
The list contains the intercept (which has not been penalized), the
regression coefficients beta
, the family, the value of
lambda
and the number of nonzero components of
beta
. By default, gds
uses the value λmin
corresponding to the minimum cross-validated loss for the lasso, using
cv.glmnet
.
str(gds_estimate)
#> List of 5
#> $ intercept : num 0.189
#> $ beta : num [1:50, 1] -3.971 -2.058 0.942 1.839 4.04 ...
#> $ family : chr "binomial"
#> $ lambda : num 0.00378
#> $ num_non_zero: num 26
#> - attr(*, "class")= chr "gds"
At the moment, only a single value of lambda
is accepted
by GDS. We will fix this in the future, as well as adding a cross
validation function.
When an estimate of the measurement error covariance matrix Σuu is available, a corrected lasso can be defined as minimizing the loss minimize L(β) = ∥y − Wβ∥2 − βTΣuuβ + λ∥β∥1 subject to ∥β∥1 ≤ R. Because we subtract the positive semidefinite matrix Σuu from the convex function ∥y − Wβ∥2, this corrected lasso may be non-convex. In fact, when p > n is is always non-convex. Hence, in order to avoid non-trivial solutions, we must constrain the solution to lie in some L1-ball with radius R. Since this problem involves two regularization parameters, λ and R, it is more convenient to use the constrained version of the lasso
minimize L(β) = ∥y − Wβ∥2 − βTΣuuβ subject to ∥β∥1 ≤ R. Loh and Wainwright (2012) analyze the properties of these two closely related versions of the lasso. They show that the bounds for the estimates of ∥β̂ − β∥22 and ∥β̂ − β∥1 are of the same order as for the standard lasso without measurement error, where β̂ is the global minimum of the optimization problem. More remarkably, they show that despite non-convexity, under mild assumptions a projected gradient descent algorithm will converge with high probability to a local optimum which is very close to the global optimum.
Sorensen, Frigessi, and Thoresen (2015) analyzed the covariate selection properties of the same model, and similarly showed that results very similar to those for covariate selection with the standard lasso in the absence of measurement, also hold for this corrected lasso.
This package implements the projected gradient descent algorithm
proposed by Loh and Wainwright (2012). It
can be found in the corrected_lasso
function. Using the
create_example_data
function defined above, we illustrate
its use.
set.seed(1000)
# Generate example data
ll <- create_example_data(n, p)
# Fit the corrected lasso
corrected_fit <- corrected_lasso(W = ll$W, y = ll$y, sigmaUU = ll$sigmaUU)
The object returned is a list
with class
corrected_lasso
.
# Class of the object
class(corrected_fit)
#> [1] "corrected_lasso"
# The coef() method prints the number of nonzero estimates as a function of the radius
coef(corrected_fit)
#> Number of nonzero coefficient estimates
#> as a function of regularization parameter (radius):
#> radius nonzeros
#> 0.01006214 1
#> 0.53911872 2
#> 1.06817530 2
#> 1.59723188 2
#> 2.12628846 2
#> 2.65534505 2
#> 3.18440163 4
#> 3.71345821 4
#> 4.24251479 4
#> 4.77157137 4
#> 5.30062795 4
#> 5.82968454 5
#> 6.35874112 5
#> 6.88779770 5
#> 7.41685428 6
#> 7.94591086 8
#> 8.47496744 14
#> 9.00402403 17
#> 9.53308061 22
#> 10.06213719 24
The arguments to the function are shown below.
args(corrected_lasso)
#> function (W, y, sigmaUU, family = c("gaussian", "binomial", "poisson"),
#> radii = NULL, no_radii = NULL, alpha = 0.1, maxits = 5000,
#> tol = 1e-12)
#> NULL
If the radii
argument is not set, a naive lasso is run
on the measurement error data, using cross-validation to find β̂naive.
The maximum of R is set to
Rmax = 2∥β̂naive∥1,
i.e., the maximum possible solution to the corrected lasso is twice as
large as the naive solution, as measured by the L1-norm. The minimum of
R is by default set to Rmin = 10−3Rmax.
The corrected lasso solution is then computed on an equally spaced grid
between Rmin
and Rmax.
The length of the grid is set by the no_radii
argument,
which by default equals 20.
The resulting estimates can be visualized using the plot
function for objects of class corrected_lasso
. Calling the
function with no additional arguments returns a plot of the number of
nonzero coefficients for each value of the constraint radius. This is
equivalent to calling
plot(corrected_fit, type = "nonzero")
.
Instead using the additional argument type = "path"
yields the full coefficient paths for the estimates for all values of
the radius.
Sorensen, Frigessi, and Thoresen (2015) show how we can also correct for measurement error in the lasso for generalized linear models, using the conditional score method introduced by Stefanski and Carroll (1987), combined with the projected gradient descent algorithm used by Loh and Wainwright (2012) for the corrected lasso for linear models.
The corrected lasso for logistic regression is implemented in
hdme
. To illustrate its use, we start by generating some
measurement error data with binomially distributed response.
We get logistic regression by specifying
family = "binomial"
in the call the
corrected_lasso()
.
The plot functions work the same way as for linear regression. By
default, the argument type = "nonzero"
.
Setting type = "path"
, we get the coefficient values
along the regularization parameter R.
Poisson regression is also available, by using the argument
family = "poisson"
to corrected_lasso
.
The corrected lasso for linear regression has a clearly defined loss
function, L(β) = ∥y − Wβ∥22 − βTΣuuβ.
In order to find the optimal value of the regularization parameter R, we can hence use cross-validation
to minimize L(β).
This is implemented in the function cv_corrected_lasso
. We
illustrate its use below.
For generalized linear models, we are using the conditional score approach, which does not yield a well defined loss function. Hence, cross-validation is not straightforward in these cases.
set.seed(1000)
# Generate example data
ll <- create_example_data(n, p)
# Run lasso with cross-validation
cv_corrected_fit <- cv_corrected_lasso(W = ll$W, y = ll$y, sigmaUU = ll$sigmaUU)
The result is a list of class cv_corrected_lasso
.
The print below shows all the elements of the resulting list.
str(cv_corrected_fit)
#> List of 6
#> $ cv :'data.frame': 100 obs. of 5 variables:
#> ..$ radii : num [1:100] 0.0119 0.1324 0.2528 0.3732 0.4937 ...
#> ..$ mean_loss: num [1:100] 8.74 8.48 8.22 7.97 7.7 ...
#> ..$ sd_loss : num [1:100] 9.07 8.74 8.42 8.1 7.8 ...
#> ..$ upper_1se: num [1:100] 11.6 11.2 10.9 10.5 10.2 ...
#> ..$ lower_1se: num [1:100] 5.88 5.71 5.56 5.41 5.23 ...
#> $ radius_min: num 9.28
#> $ loss_min : num -4.11
#> $ radius_1se: num 7.48
#> $ loss_1se : num -3.79
#> $ family : chr "gaussian"
#> - attr(*, "class")= chr "cv_corrected_lasso"
The element cv
contains all the details of the
cross-validation runs.
radii
contains all the constraint radii R that were used.mean_loss
contains the mean of the loss function at the
given radius over all folds.sd_loss
gives the standard deviation of the loss at the
given radius.upper_1se
and lower_1se
contain the upper
and lower standard error of mean_loss
at the given
radius.Next, loss_min
gives the minimum of the mean loss, and
radius_min
is the corresponding radius.
loss_1se
gives the smallest loss within one standard error
of loss_min
, and radius_1se
is a smallest
radius given a loss less than or equal to loss_1se
.
The result of performing cross-validation can be illustrated using
the plot
function. It shows the cross-validated loss over
the grid of radii. Rmin
and R1se
are shown with labels, and the corresponding loss as horizontal red
lines.
Having used cv_corrected_lasso
to find the right value
of the constraint parameter R,
we can use corrected_lasso
on this single value. The
snippet below shows how to compute the solution using R1se.
The final parameter estimates can be found in
corrected_fit$betaCorr
.
The Matrix Uncertainty Selector (MUS) was introduced by Rosenbaum and Tsybakov (2010) as a modification of the Dantzig Selector for data with measurement error. The key insight behind the MUS, is that when X is measured with error, the true coefficient vector β may not be part of the feasible set, even when λ is set to its theoretically optimal value. The reason is that λ is a bound on the residual of the linear model, while in the case of measurement error, a bound on δ the measurement error matrix U is also needed. The MUS is defined as the optimization problem minimize ∥β∥1, subject to (1/n)∥WT(y − Wβ)∥∞ ≤ λ + δ∥β∥1. Note that the MUS does not require an estimate of the measurement error covariance matrix Σuu. This might be a practical advantage in some cases, when an estimate of Σuu is hard to obtain.
The MUS can be converted to a linear programming problem (Sorensen et al. (2018)). The hdme
package contains a function mus
for computing the Matrix
Uncertainty Selector. It uses Rglpk
(Theussl and Hornik (2019)) for solving the
underlying linear program. In order to illustrate mus
, we
generate some example data with measurement error.
set.seed(1)
# Number of samples
n <- 1000
# Number of covariates
p <- 50
# Generate data
ll <- create_example_data(n, p, sdU = 0.2)
We provide the measurement matrix and the response. The solution is
computed over a grid of δ
values, and with λ set to the
cross-validated optimum for the lasso, as chosen by
cv.glmnet
in the glmnet
package. The returned
object is a list with class gmus
, which stands for
generalized matrix uncertainty selector.
The coef() methods shows the number of nonzero coefficients as a function of δ.
coef(mus_fit)
#> Number of nonzero coefficient estimates
#> as a function of regularization parameters
#> (lambda, delta):
#> lambda delta nonzeros
#> 0.026 0.00 29
#> 0.026 0.02 5
#> 0.026 0.04 14
#> 0.026 0.06 17
#> 0.026 0.08 12
#> 0.026 0.10 7
#> 0.026 0.12 6
#> 0.026 0.14 5
#> 0.026 0.16 4
#> 0.026 0.18 4
#> 0.026 0.20 4
#> 0.026 0.22 4
#> 0.026 0.24 4
#> 0.026 0.26 4
#> 0.026 0.28 4
#> 0.026 0.30 4
#> 0.026 0.32 4
#> 0.026 0.34 4
#> 0.026 0.36 4
#> 0.026 0.38 4
#> 0.026 0.40 4
#> 0.026 0.42 4
#> 0.026 0.44 4
#> 0.026 0.46 3
#> 0.026 0.48 3
#> 0.026 0.50 3
The default plot method shows the number of nonzero coefficients along the grid of δ values chosen by the algorithm.
According to the “elbow rule” (Rosenbaum and
Tsybakov (2010)), a final value of δ can be chosen where the curve
starts to level off. This is not always easy in practice, but in the
plot above, a value between 0.05 and
0.1 may be reasonable. Choosing 1.0, we can call the algorithm again, but
this time setting the argument delta = 0.1
.
Since only one value of δ was provided, the plot method will now show all the estimated coefficients, rather than the number of nonzero coefficients, as in the previous plot.
The Generalized Matrix Uncertainty Selector (GMUS) is an extension of the MUS to generalized linear models, introduced by Sorensen et al. (2018). It is defined as the solution to the optimization problem
$$ \text{minimize } ~ \|\beta\|_{1}, ~ \text{ subject to } ~ (1/n)\|W^{T} (y - \mu(W \beta))\|_{\infty} \leq \lambda + \sum_{r=1}^{R}\delta^{r}(r! \sqrt{n})^{-1} \|\beta\|_{1}^{r} \|\mu^ {(r)}(W\beta)\|_{2}, $$ where μ(⋅) is the vector valued mean function of the generalized linear model and μ(r)(⋅) is its rth derivative. R is a parameter controlling the number of Taylor expansion terms which are included. When R → ∞, the true solution is a member of the feasible set given that the bounds (1/n)∥WTϵ∥∞ ≤ λ and ∥U∥∞ ≤ δ hold. In practice, we set R = 1 for computational reasons. When R = 1, the GMUS can be solved using a sequence of linear programming problems on the same form as the MUS, with an iterative reweighing algorithm.
The gmus
function in the hdme
package
implements the GMUS for R = 1,
which is defined as minimize
∥β∥1, subject to
(1/n)∥WT(y − μ(Wβ))∥∞ ≤ λ + δn−(1/2)∥β∥1∥μ(1)(Wβ)∥2.
As the MUS, the GMUS does not require an estimate of Σuu.
We illustrate gmus
using logistic regression. First, we
generate some sample data.
set.seed(323)
n <- 100
p <- 50
ll <- create_example_data(n, p, sdU = 0.2, family = "binomial")
gmus_fit <- gmus(ll$W, ll$y, family = "binomial")
The returned object is again a list with class gmus
.
class(gmus_fit)
#> [1] "gmus"
str(gmus_fit)
#> List of 6
#> $ intercept : num [1:26] -0.587 -0.322 -0.257 -0.225 -0.204 ...
#> $ beta : num [1:50, 1:26] -3.223 -1.406 0.885 1.696 2.729 ...
#> $ family : chr "binomial"
#> $ delta : num [1:26] 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 ...
#> $ lambda : num 0.00926
#> $ num_non_zero: num [1:26] 29 14 10 8 7 5 5 5 5 5 ...
#> - attr(*, "class")= chr "gmus"
The model fitting works the same way as for the MUS. A λ value is found by running the
lasso with the appropriate link function, and taking the value yielding
minimum cross-validation loss. A range of δ values are tried. We can call the
plot
function to study the behavior.
Again we see that the number of nonzero coefficients decreases in
δ. Following the elbow rule, a
reasonable choice for δ might
here be 0.1. To obtain a final
estimate, we might therefore run gmus
again with this
value.
The plot will now show the estimated coefficients.
The Generalized Matrix Uncertainty Selector is also implemented for
Poisson regression. Use the argument family = "poisson"
to
gmus
.
A “lasso equivalent” of the Generalized Matrix Uncertainty Selector can also be defined (Rosenbaum and Tsybakov (2010), Sorensen et al. (2018)). We refer to Sorensen et al. (2018) for the details of this algorithm.
This GMU Lasso is implemented in hdme
, and can be called
with the function gmu_lasso
. The snippet below shows its
use. The underlying coordinate descent algorithm is implemented in
C++
using the RcppArmadillo
package (Eddelbuettel and Sanderson (2014)). Both
logistic and Poisson regression are supported, with arguments
family = "binomial"
and family = "poisson"
,
respectively.
set.seed(323)
n <- 100
p <- 50
ll <- create_example_data(n, p, sdU = 0.2, family = "binomial")
gmu_lasso_fit <- gmu_lasso(ll$W, ll$y, family = "binomial")
The returned object is a list with class gmu_lasso
.
class(gmu_lasso_fit)
#> [1] "gmu_lasso"
str(gmu_lasso_fit)
#> List of 6
#> $ intercept : num [1:26] -0.591 -0.378 -0.304 -0.266 -0.242 ...
#> $ beta : num [1:50, 1:26] -3.325 -1.382 0.897 1.769 2.799 ...
#> $ family : chr "binomial"
#> $ delta : num [1:26] 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 ...
#> $ lambda : num 0.00926
#> $ num_non_zero: num [1:26] 28 24 15 13 11 9 9 7 7 7 ...
#> - attr(*, "class")= chr "gmu_lasso"
The plotting function gives an elbow plot, which can be used to
select the regularization parameter delta
.