Bioinformatics
Oxford University Press
image
selectBoost: a general algorithm to enhance the performance of variable selection methods
DOI 10.1093/bioinformatics/btaa855 , Volume: 37 , Issue: 5 , Pages: 659-668
Article Type: research-article, Article History
Abstract

Motivation

With the growth of big data, variable selection has become one of the critical challenges in statistics. Although many methods have been proposed in the literature, their performance in terms of recall (sensitivity) and precision (predictive positive value) is limited in a context where the number of variables by far exceeds the number of observations or in a highly correlated setting.

Results

In this article, we propose a general algorithm, which improves the precision of any existing variable selection method. This algorithm is based on highly intensive simulations and takes into account the correlation structure of the data. Our algorithm can either produce a confidence index for variable selection or be used in an experimental design planning perspective. We demonstrate the performance of our algorithm on both simulated and real data. We then apply it in two different ways to improve biological network reverse-engineering.

Availability and implementation

Code is available as the SelectBoost package on the CRAN, https://cran.r-project.org/package=SelectBoost. Some network reverse-engineering functionalities are available in the Patterns CRAN package, https://cran.r-project.org/package=Patterns.

Supplementary information

Supplementary data are available at Bioinformatics online.

Bertrand, Aouadi, Jung, Carapito, Vallat, Bahram, Maumy-Bertrand, and Lenore: selectBoost: a general algorithm to enhance the performance of variable selection methods

1 Introduction

Technological innovations make it possible to measure large amounts of data in a single observation. As a consequence, problems in which the number P of variables is larger than the number N of observations have become common. As reviewed by Fan and Li (2006), such situations arise in many fields from fundamental sciences to social science, and variable selection is required to tackle these issues. For example, in biology/medicine, thousands of messenger RNA (mRNA) expressions (Lipshutz et al., 1999) may be potential predictors of some disease. Moreover, in such studies, the correlation between variables is often very strong (Segal et al., 2003), and variable selection methods often fail to make the distinction between the informative variables and those which are not. Similarly, inference of gene regulatory networks from perturbation data can enhance the insights of a biological system (Morgan et al., 2019). In this article, we propose a general algorithm that enhances model selection in correlated variables.

First, we will assume a statistical model with a response variable y=(y1,,yN) (with the symbol ‘’’ as the transposed), a variable matrix of size N × P, X=(x1.,,xP.) and a vector of parameters β=(β1,,βP). Then, we will assume that the vector of parameters β=(β1,,βP) is sparse. In other words, we will assume that βi=0 except for a quite small proportion of elements of the vector. We note S as the set of indices for which βi0 and q< is the cardinality of this set S. Without any loss of generality, we will assume that βp0 if and only if pq.

When dealing with a problem of variable selection, one of the goals is the estimation of the support, in which you want (S=S^) to be close to one, with S^={k:β^k0}. Here, our interest is mainly as follows, i.e. in identifying the correct support S. This kind of issue arises in many fields, e.g. in biology, where it is of greatest interest to discover which specific molecules are involved in a disease (Fan and Li, 2006).

There is a vast literature dealing with the problem of variable selection in both statistical and machine-learning areas (Fan and Li, 2006; Fan and Lv, 2010). The main variable selection methods can be gathered in the common framework of penalized likelihood. The estimate β^ is then given by:

β^=argminβP[N(β)+p=1Ppenλ(βp)],
where N(.) is the log-likelihood function, penλ(.) is a penalty function and λ is the regularization parameter. As the goal is to obtain a sparse estimation of the vector of parameters β, a natural choice for the penalty function is to use the so-called 0 norm (.0), which corresponds to the number of non-vanishing elements of a vector:
penλ:{0,λ}x{penλ(x)=λif x0penλ(x)=0else,
which induces p=1Ppenλ(βp)=λβ0. For example, when λ  =1, we get the Akaike Information Criterion (AIC) (Akaike, 1974) and when λ=log(N)2, we get the Bayesian Information Criterion (BIC) (Schwarz, 1978).

Many different penalties can be found in the literature. Solving this problem with .0 as part of the penalty is an NP-hard problem (Fan and Lv, 2010; Natarajan, 1995). It cannot be used in practice when P becomes large, even when it is employed with some search strategy like forward regression, stepwise regression (Hocking, 1976) and genetic algorithms (Koza et al., 1999). Donoho and Elad (2003) showed that relaxing .0 to norm .1 ends, under some assumptions, to the same estimation. This result encourages the use of a wide range of penalties based on different norms. For example, the case where penλ(βp)=λ|βp| is the lasso estimator (Tibshirani, 1996) [or equivalently Basis Pursuit Denoising (Chen et al., 2001)] whereas penλ(βp)=λβp2 leads to the Ridge estimator (Hoerl and Kennard, 1970). Nevertheless, the penalty term induces variable selection only if:

minx0(dpenλ(x)dx+x)>0.

Equation (3) explains why the lasso regression allows for variable selection, while the Ridge regression does not. The lasso regression is, however, known to lead to a biased estimate (Zou, 2006). The Smoothly Clipped Absolute Deviation (SCAD) (Fan, 1997), Minimax Concave Penalty (Zhang, 2010) or adaptive lasso (Zou, 2006) penalties all address this problem. The popularity of such variable selection methods is linked to fast algorithms like Least-Angle Regression Selection (Efron et al., 2004), coordinate descent or Penalized Linear Unbiased Selection (Zhang, 2010).

Nevertheless, the goal of identifying the correct support of the regression is complicated and the reason why variable selection methods fail to select the set of non-zero variables S can be summarized in two words: linear correlation. Choosing the lasso regression as a special case, Zhao and Yu (2006) stated that if an irrelevant predictor is highly correlated with the predictors in the true model, lasso may not be able to distinguish it from the true predictors with any amount of data and any amount of regularization. Zhao and Yu (2006) [and simultaneously Zou (2006)] found an almost necessary and sufficient condition for lasso sign consistency (i.e. selecting the non-zero variables with the correct sign). This condition is known as ‘irrepresentable condition’:

|XSXS(XSXS)1sgn(βS)|<1,
where XS=(xij)i,jS2,XS=(xij)i,jS2,βS=(βp)pS. In other words, when sgn(βS)=1, this can be seen as the regression of each variable, which is not in S over the variables, which are in S. As all variables in the matrix X are centered, the absolute sum of the regression parameters should be smaller than 1 to satisfy this ‘irrepresentable condition’.

Facing this issue, existing variable selection methods can be split into two categories:

  • those which are ‘regularized’ and try to give similar coefficients to correlated variables [e.g. elastic net (Zou and Hastie, 2005)],
  • those which are not ‘regularized’ and pick up one variable among a set of correlated variables [e.g. the lasso (Tibshirani, 1996)].

The former group can further be split into methods in which groups of correlation are known, such as the group lasso (Friedman et al., 2010a; Yuan and Lin, 2006) and those in which groups are not known as in the elastic net (Zou and Hastie, 2005). The latter combines the 1 and the 2 norm and takes advantage of both. Non-regularized methods will select some co-variables among a group of correlated variables while regularized methods will select all variables in the same group with similar coefficients.

The main idea of our algorithm is to consider that any observed value of a group of linearly correlated variables of the X matrix is the independent realization of a given random function. This common random function is then used to perturb the observed values of the relevant correlated variables. Strictly speaking, the use of noise to determine the informative variables is not a new idea. For example, it has been shown that adding random pseudo-variables decreases over-fitting (Wu et al., 2007). In the case where P > N, the pseudo-variables are generated either with a standard normal distribution N(0,1) or by using permutations on the matrix X (Wu et al., 2007). Another approach consists of adding noise to the response variable and leads to similar results (Luo et al., 2006). The rationale of this method is based on the work of Cook and Stefanski (1994), which introduces the simulation-based algorithm SIMEX (Cook and Stefanski, 1994). Adding noise to the matrix X has already been used in the context of microarrays (Chen et al., 2007). Simsel (Eklund and Zwanzig, 2012) is an algorithm that both adds noise to variables and uses random pseudo-variables. One new and inspiring approach is stability selection (Meinshausen and Bühlmann, 2010) in which the variable selection method is applied on sub-samples, and informative variables are defined as variables which have a high probability of being selected. Bootstrapping has been applied to the lasso on both the response variable and the matrix X with better results in the former case (Bach et al., 2008). A random lasso, in which variables are weighted with random weights, has also been introduced (Wang et al., 2011).

In this article, following the idea of using simulation to enhance the variable selection methods, we propose the SelectBoost algorithm. Unlike other algorithms reviewed above, it takes into account the correlation structure of the data. Furthermore, our algorithm is motivated by the fact that in the case of non-regularized variable selection methods, if a group contains variables that are highly correlated together, one of them will be chosen with precision.

2 Materials and methods

The SelectBoost algorithm has been designed in a general framework in order to avoid to select non-predictive correlated features. The main goal is to improve the predictive positive value (PPV), i.e. the proportion of selected variables which truly belong to S.

2.1 Generate new perturbed design matrix

As we assume that the variables are centered and that xp.2=1 for p=1,,P, we know that xp.SN2. Indeed, the normalization puts the variables on the unit sphere SN1. The process of centering can be seen as a projection on the hyperplane HN1 with the unit vector as normal vector. Moreover, the intersection between HN1 and SN1 is SN2. We further define the following isomorphism:

ϕ:HN1N1hnϕ(hn)=fnn=1,,N1,
where {hn}n=1,,N1 is an orthogonal base of HN1 and {fn}n=1,,N1 is the canonical base of N1. We define:
hn=i=1neinen+1i=1neinen+1,
with {en}n=1,,N the canonical base of N. Note that ϕ(SN2)=SN2, and that is why we can work in N1 and then return in N.

Here, we make the assumption that a group of correlated variables are independent realizations of the same multivariate Gaussian distribution. As the variables are normalized with respect to the 2 norm, we will use the von Mises–Fisher distribution (Sra, 2012) in N1 thanks to the isomorphism ϕ in order to generate new perturbed design matrix. The probability density function of the von Mises–Fisher distribution for the random P-dimensional unit vector x is given by:

fP(x;μ,κ)=K˜P(κ)exp(κμx),
where κ0,μ=(μ1,,μP),μ2=1, and the normalization constant K˜P(κ) is equal to:
K˜P(κ)=κP/21(2π)P/2IP/21(κ),
where Iv denotes the modified Bessel function of the first kind and order v (Abramowitz and Stegun, 1972). We denote by μ^ and κ^ the maximum likelihood estimators of the μ and κ parameters.

The multivariate Gaussian distribution assumption is not restrictive. As long as the group of correlated variables is independent realizations of the same distribution, the SelectBoost algorithm can be applied: either directly to assess the stability of the selected variables with perturbed datasets with an increasing noise level, which is the core idea behind the SelectBoost algorithm, or after replacing the von Mises–Fisher distribution with a more relevant one.

2.2 The SelectBoost algorithm

To use the SelectBoost algorithm, we need a grouping method grc0 depending on a user-provided constant 0c01. This constant determines the strength of the grouping effect. The grouping method maps each variable index 1,,P to an element of P({1,,P}) [with P(S) the powerset of the set S, i.e. the set which contains all the subsets of S]. Concretely, grc0(p) is the set of all variables, which are considered to be linked to the variable xp and Xgrc0(p) is the submatrix of X containing the columns which indices are in grc0(p). We impose the following constraints to the grouping function:

p{1,P}:gr1(p)={p}  and  gr0(p)={1,P}.

Furthermore, we need to have a selection method:

select:N×P×N{0,1}P,
which maps the design matrix X and the response variable y to a 0–1 vector of length P with 1 at position p if the method selects the variable p and 0 otherwise. We then use the von Mises–Fisher distribution to generate replacement of the original variables by some simulations (see Algorithm 1) to create B new design matrices X(1),,X(B). The SelectBoost algorithm then applies the variable selection method select to each of these matrices and returns a vector of length P with the frequency of apparition of each variable. The frequency of apparition of variable xp., noted ζp is assumed to be an estimator of the probability (xp.S) for this variable to be in S. The choice of c0 is crucial. On the one hand, when this constant is too large, the model is not perturbed enough. On the other hand, when this constant is too small, variables are chosen at random.

The SelectBoost algorithm returns the vector ζ=(ζ1,,ζP). Each of these values has to be compared to a threshold ζmin to determine which variables are selected: we choose to select a variable p if ζpζmin. The simulation study showed that the choice of the threshold is critical and the algorithm can be improved if we enforce that the ζp values—as functions of c0 —are non-increasing, see Figure 1 bottom. This additional requirement makes sense: the more variables the resampling process involves—with smaller c0— the less a given variable will be selected.

Top: evolution of the recall, PPV and F-score as a function of 1−c0 for LASSO-based SelectBoost and AICc model selection criterion for Type1 simulated data with a non-increasing post-processing step and a threshold ζmin=1. If c0≤0.25 models are empty. Bottom: the distribution of the PPV for a 0.25 threshold and c0=mean(q90,q100) for SPLS-based SelectBoost, Type1 data and raw SelectBoost (left) or SelectBoost with a non-increasing post-processing step (right)
Fig. 1.
Top: evolution of the recall, PPV and F-score as a function of 1c0 for LASSO-based SelectBoost and AICc model selection criterion for Type1 simulated data with a non-increasing post-processing step and a threshold ζmin=1. If c00.25 models are empty. Bottom: the distribution of the PPV for a 0.25 threshold and c0=mean(q90,q100) for SPLS-based SelectBoost, Type1 data and raw SelectBoost (left) or SelectBoost with a non-increasing post-processing step (right)

2.3 Choosing the parameters of the algorithm

We first have to choose the grouping function. One of the simplest ways to define a grouping function grc0 is the following:

grc0(p)={q{1,,P}||<xp.,xq.>|c0}.

In other words, the correlation group of the variable p is determined by variables whose correlation with xp. is at least c0. In another way, the structure of correlation may further be taken into account using graph community clustering. Let C be the correlation matrix of matrix X. Let define Cˇ as follows:

cˇij={|cˇij|if|cˇij|>c0  and  ij0otherwise.

Then, we apply a community clustering algorithm on the undirected network with weighted adjacency matrix defined by Cˇ. Using a graph community clustering algorithm is helpful with large datasets while still clustering similar variables together. For instance, the fast greedy modularity optimization algorithm for finding community structure (Clauset et al., 2004) runs in essentially linear time for many real-world networks given that they are sparse and hierarchical.

Once the grouping function is chosen, we have to choose parameter c0 . Due to the constraints in Equation (6), the SelectBoost algorithm results in the initial variable selection method when c0=1. As we will show in the next section, the smaller the parameter c0, the higher the precision of the resulting selected variables. On the other hand, it is obvious that the probability of choosing none of the variables (i.e. resulting in the choice of an empty set) increases as the parameter c0 decreases. In the perspective of experimental planning, the choice of c0 should result of a compromise between precision and proportion of active identified variables. Hence, the c0 parameter can be used to introduce a confidence index γp related to the variable xp.:

γp=1minxp.S^c0c0,hence0γp1.

Algorithm 1 Pseudo-code for the SelectBoost algorithm

Require: grc0,select,B,c0 ζ0P

forb=1,,Bdo

  X(b)X

  forp=1,,Pdo

   xp.(b)ϕ1(random-vMF(μ^(ϕ(Xgrc0(p))),κ^(ϕ(Xgrc0(p))))

  end for

  ζζ+select(X(b),y)

end for

ζζ/B

3 Numerical studies

We benchmarked the algorithm with a large simulation study with four data generation processes and three real datasets (Table 1). Generated datasets are available upon request and real datasets are available either upon request or online.

    • Simulation with 1000 variables and one linear response. A cluster of 50 variables is linked to the response.
    • Simulation with 1000 variables and one binary response. A cluster of 50 variables is linked to the response.
    • Data are 200 uncorrelated (‘unlinked’) single nucleotide polymorphisms (SNPs) with simulated genotypes, in which the first 20 of them affect the outcome with three covariates; 400 observations;
    • Data are 100 uncorrelated (‘unlinked’) SNPs with simulated genotypes, in which the first 10 of them affect the outcome with two covariates; 750 observations.
    • The Huntington dataset is a real dataset with 28 087 variables observed on 69 individuals. We first applied independent filtering and removed 10 370 variables. We applied the SelectBoost algorithm to 17 717 variables observed on 69 individuals.
Table 1.
Summary of the types of datasets used to benchmark the SelectBoost algorithm
NameDataIndividualsVariables
Type1Simulated1001000
Type2Simulated1001000
Type3Simulated400203
Type4Simulated750102
LeukemiaObserved723571
HuntingtonObserved6917 717
MelanomaObserved2825 268

For Types 1 and 2, the number of variables is 1000, and the number of observations is 100. The data are generated from a cluster simulation (Bair et al., 2006; Bastien et al., 2015). Only 50 first predictors are linked to the response Y and the last 950 variables are randomly generated from a standard normal distribution. For Example 3, the response variable is linear but was turned into a binary variable (+1 when Yi>0 and –1 when Yi<0).

Examples 1 and 3 are linear regression examples whereas 2, 4, 5, 6 and 7 are logistic regression ones, for which, we will assume a logistic model with a binary response variable (Peng et al., 2002).

We provide results for 12 different settings based on 10 different models, see Supplementary Section S1 for more details.

    • Logistic regression (five types): logistic LASSO (glmnet based) with model choice based on 5-fold cross-validation, logistic LASSO (glmnet based) with model choice based on information criteria (AICc, BIC), varbvs binomial, SPLSda (Chun and Keles, 2010) and sgpls (Chun and Keles, 2010).

The SelectBoost algorithm is based on correlated resampling and hence random. We wanted to assess both the stability and performance of the algorithm. As a consequence, for the four types of simulated data, we focused both on what may be called a repeatability study (a given dataset was analyzed 100 times to estimate the variation only due to the fact that the algorithm is random) and a reproducibility study (100 different datasets were generated and analyzed to estimate the variability due to both data simulation—from the same data generation—and the fact that the algorithm is random).

The repeatability issue raised was raised, for instance by Boulesteix (2014) and Magnanensi et al. (2017) for PLS models. For those models, random split cross-validation is known to have poor repeatability. We used two types of grouping functions (either determined by variables whose correlation with xp is at least c0 -gdirect- or community clustering-based -gcc-).

The cost (memory and time) of the random generation step can be limited thanks to a sparse correlated resampling feature. The remaining cost of the algorithm is B×Nc0×Time1 with B the number of resampling and Nc0 the number of c0 values that are investigated and Time1 the time to fit the model once.

To demonstrate the performance of the SelectBoost method, we compared our method with stability selection (Meinshausen and Bühlmann, 2010) and with a naive version of our algorithm, naiveSelectBoost. The naiveSelectBoost algorithm works as follows: estimate β with any variable selection method then if grc0(p), as defined in Equation (7) e.g. is not reduced to p, shrink to 0. The naiveSelectBoost algorithm is similar to the SelectBoost algorithm, except that it does not take into account the error, which is made choosing at random a variable among a set of correlated variables.

We use four indicators to evaluate the abilities of our method on simulated data. We define:

  • recall as the ratio of the number of correctly identified variables (i.e. β^i0 and βi0) over the number of variables that should have been discovered (i.e. βi0).
  • precision as the ratio of correctly identified variables (i.e. β^i0 and βi0) over the number of identified variables (i.e. β^i0).
  • F-score as the following ratio:
    2×recall×precisionrecall+precision·
  • selection as the average number of identified variables (i.e. β^i0).

Note that our interest is focused on precision, as our goal is to select reliable variables. As stated before, when c0 is decreasing toward zero, we expect a profit in precision and a decrease in recall. We also compute the F-score, which combines both recall and precision. As an improvement of precision comes with a decrease in the number of identified variables, the best method is the one with the highest precision for a given level of selection.

3.1 Results of the numerical studies

We show the evolution of the four criteria (recall, precision, F-score and selection) with regards to the decrease of c0. When c0=1, the SelectBoost algorithm is equivalent to the initial variable selection method. We introduce a post-processing step to enforce that, for a given variable, the proportion of selection is non-increasing. It is the expected behavior since the correlated resampling is not meant to increase the probability of selection for a variable. Such an increase may happen for small c0 values when a variable that is not linked with the response is mixed with a variable that is linked to the response. For all the simulations, this post-processing step increases the PPV of the SelectBoost algorithm, see Figure 1. As our primary focus is PPV, we recommend the use of this post-processing step. More details can be found in the Supplementary Graphs S7S174.

We created precision-recall plots to display the effects of the algorithm on the performance of all the models and criteria used for a given dataset. Identical model fitting criteria share the same colors. The arrows point toward decreasing c0 values. Direct grouping and community grouping lead to similar results, Figure 2 and Supplementary Figures S1, S3 and S4. These Figures also show that the results for a single dataset repeated 100 times are similar to results for 100 different datasets. The Zoom l sequence, which is a 10-step regular grid from the q100% quantile—the maximum value—to q90%—the quantile of order 0.9—achieves high PPV, Supplementary Figure S299.

Recall-precision curve. All models and criteria non-increasing SelectBoost. Type 1 data. Direct grouping. A total of 100 different datasets. ζmin=1
Fig. 2.
Recall-precision curve. All models and criteria non-increasing SelectBoost. Type 1 data. Direct grouping. A total of 100 different datasets. ζmin=1

Supplementary Figure S5 displays an example of raw SelectBoost (without the non-increasing post-processing step) for direct grouping and 100 different datasets that should be compared to Figure 2. This effect is even stronger smaller values for the ζmin threshold. The non-increasing post-processing step greatly improves the results of the algorithm and leads to monotonic relationships between the recall, the precision and the c0 value. PPV benefit less from smaller values for the ζmin threshold, Supplementary Figure S6.

All the results of the simulation study showed the good performance and stability of the algorithm, which we then applied once to each of the real datasets. The results of the gdirect and gcc based SelectBoost are similar, the gcc based being a bit more time consuming than the gdirect one. In the following of this article, we reported results and figures for gdirect-based SelectBoost.

According to our simulation studies, Supplementary Graphs S7S174, one should choose c0 between q90% and q100%, see Figure 1 top. In our simulation studies, we used an 11-steps c0 sequence, but, according to our results, it could be limited to 6 steps [from mean(q90%,q100%) to q100%] for the biggest datasets. The value of B should not be lower than 10. B = 50 or B = 100 will provide more stable results. As a consequence, the minimal time cost of the SelectBoost algorithm will be 60 times the time cost of the regular model fit, which could be afforded in almost every case. The parallel processing support of the SelectBoost package can help to reduce this time. Hence, the SelectBoost seems feasible with most of the datasets and even omics datasets as we did in our simulation study with the three real datasets.

Hence, to assess the performance of the SelectBoost algorithm, we performed comprehensive numerical studies. As stated before, the SelectBoost algorithm can be applied to any existing variable selection method.

Figure 1 top shows the result for the lasso selection with a penalty parameter chosen using information criteria for Type 1 datasets. In this example, we improve the precision up to 1. Moreover, as shown by Figures 1 and 3, the proportion of models, for which the precision reaches one, increases with the decrease of c0. The F-score increases, remains either stable or shows a small decrease indicating that the increase of PPV compensates the loss in recall.

Top: The average number of identified variables is plotted as a function of the proportion of correctly identified variables for Type1 simulated data and all models. Middle and bottom: effect of the SelectBoost algorithm wrt 1−c0 for adaptive elastic net and AICc model selection criterion with c0 in the range [q90%;q100%] for 100 different (middle, reproducibility) or 100 identical (bottom, repeatability) Type3 simulated data with a non-increasing post-processing step and a threshold ζmin=1. Only results for non-empty models are shown
Fig. 3.
Top: The average number of identified variables is plotted as a function of the proportion of correctly identified variables for Type1 simulated data and all models. Middle and bottom: effect of the SelectBoost algorithm wrt 1c0 for adaptive elastic net and AICc model selection criterion with c0 in the range [q90%;q100%] for 100 different (middle, reproducibility) or 100 identical (bottom, repeatability) Type3 simulated data with a non-increasing post-processing step and a threshold ζmin=1. Only results for non-empty models are shown

In the previous section, we mentioned the possibility of using SelectBoost to obtain a confidence index, corresponding to one minus the lowest c0 for which a variable is selected. For each c0 , we plotted the average number of selected variables as a function of the proportion of correctly identified variables (Fig. 3 and Supplementary Figs S300S304). As expected, the proportion of correctly identified variables increases with the increase of the confidence index and with the decrease of the average number of identified variables. Therefore, the proportion of non-predictive features decreases with the increase of the confidence index.

The SelectBoost algorithm shows its superiority over the naive SelectBoost algorithm. The error made when choosing a variable randomly among a set of correlated variables leads to more incorrect choices of variables. While the intensive simulation of our algorithm allows taking into account this error, the naiveSelectBoost does not.

Finally, we compare the SelectBoost algorithm with stability selection. Stability selection uses a resampling algorithm to determine which of the variables included in the model are robust. In our simulation, stability selection shows performance with high precision but also low recall. Moreover, in contrast to the SelectBoost algorithm, stability selection does not allow to choose a convenient precision-PPV trade-off.

The timings of the algorithm can be found on Supplementary Figures S175–S222.

4 Application to three real datasets

We applied our algorithm to three real datasets. We studied, with respect to the threshold, the number of non-zero variables, the number of variables selected by SelectBoost and their ratio. We found results that were concordant with those of the simulated datasets. Figure 4 displays those results for a SGPLS-based SelectBoost of the Leukemia dataset with a 0.25 threshold (ζmin=0.25). See Supplementary Figures S247–S298.

% of non-zero coefficients wrt to c0 for SGPLS-based SelectBoost models of the leukemia datasets and threshold ζmin=0.25
Fig. 4.
% of non-zero coefficients wrt to c0 for SGPLS-based SelectBoost models of the leukemia datasets and threshold ζmin=0.25

We report the results for the RNA-Seq dataset providing mRNA expressions from Huntington’s disease and neurologically normal individuals. This dataset was downloaded from the GEO database under accession number GSE64810 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE64810). This dataset contains 20 Huntington’s disease cases and 49 neurologically normal controls and includes 28 087 genes as explanatory variables. An independent filtering (Bourgon et al., 2010) preprocessing step was first performed using data-based filtering for replicated high-throughput transcriptome sequencing experiments (Rau et al., 2013). Then, we applied the lasso selection method to this reduced dataset (see Fig. 5 left for the whole path of the solution). We used cross-validation to choose the appropriate level of penalization [i.e. the λ parameter in Equation (3)].

Colors: the green is for the most reliable variables selected by the SelectBoost algorithm [confidence index of 0.3; orange is for intermediate confidence (0.25) and red for low confidence (0.15)]. Left: evolution of the coefficients in the lasso regression when the regularization parameter λ is varying. For the λ range shown, the red, orange and green lines stick to zero. Right: evolution of the probability of being in the support of the regression when the confidence index is varying. The dotted line represents the 0.95 threshold
Fig. 5.
Colors: the green is for the most reliable variables selected by the SelectBoost algorithm [confidence index of 0.3; orange is for intermediate confidence (0.25) and red for low confidence (0.15)]. Left: evolution of the coefficients in the lasso regression when the regularization parameter λ is varying. For the λ range shown, the red, orange and green lines stick to zero. Right: evolution of the probability of being in the support of the regression when the confidence index is varying. The dotted line represents the 0.95 threshold

We then applied our SelectBoost algorithm on the lasso method with penalty parameter chosen by cross-validation. We use a range for the c0 parameter starting from 1 to 0.7 with steps of 0.05, which corresponds to a confidence index from 0 to 0.3. For each step, the probability of being included in the support S was calculated with 200 simulations as described in Algorithm 1. We set the threshold of being in the support to 0.95 to avoid numerical instability. We classify the selected variables into three categories: those that are identified for each confidence index from 0 to 0.15 (red), those identified from 0 to 0.25 (orange) and those identified from 0 to 0.3 (green). The last category contains the most reliable variables selected by the SelectBoost algorithm because these variables are identified from low to high confidence index.

With the lasso selection method, 15 variables were selected. Among them, four genes were identified by SelectBoost into the three different categories of confidence index (see Fig. 5 right): two genes for low confidence (red) (ANXA3 and INTS12), one gene for intermediate confidence (orange) (NUB1) and one gene for high confidence (green) (PUS3).

The interesting point, in these three examples, is that the identified variables are neither the first variables selected by the lasso nor the variables with the highest coefficients (see Fig. 5 left). This result demonstrates that our algorithm can be advantageous to select variables with high confidence and not just to select variables with the highest coefficients.

Finally, we decided to assess the differential expression of these genes between patients and controls, using the limma package (Linear Models for Microarray and RNA-Seq Data) (Ritchie et al., 2015). The four identified genes are significantly down-expressed by neurologically healthy controls confirming the result of a logistic model with these four genes.

5 Robust reverse-engineering of networks

Sparsity is a well-known feature of most biological networks (Barabási et al., 2003). An actor can only be regulated by a small number of other actors, whereas it may regulate any number of other actors. Hence, variable selection methods, such as the lasso, ensure that sparsity feature and are often core components of most of the biological network reverse-engineering tools. As a consequence, we propose to apply the SelectBoost algorithm in two different ways in order to improve the biological network reverse-engineering: as a post-processing step after the inference was made or during the inference itself in order to select the most stable predictors for each node in the network. When used as a post-processing step, one can assess for any of the inferred links between the actors of the network, its confidence index against correlated resampling of the predictors. When used during the inference step, one can infer a model that is only built with links with a high enough confidence index. The former is implemented in the SelectBoost package as a new method for the Cascade package (Jung et al, 2014) . The latter is implemented in the new Patterns CRAN package as a dedicated fitting function and is especially useful when trying to find targets for biological intervention that are strongly related to markers of some diseases through the reverse-engineered network and useful and reliable links.

We benchmarked those two uses of the algorithm with a particular type of biological networks that we have been using for several years: cascade networks (Vallat et al., 2013).

For the post-inference processing, we first fit a model to a cascade network using the Cascade package inference function. Then, we compute confidence indices for the inferred links using the SelectBoost algorithm, more details, as well as the code, of the simulations can be found in the vignette of the package ‘Towards Confidence Estimates in Cascade Networks using the SelectBoost Package’, available at https://fbertran.github.io/SelectBoost/articles/confidence-indices-Cascade-networks.html. An example of those results is shown in Figure 6 with a cascade network for four time points and four groups of 25 actors.

Post-inference analysis of an inferred cascade network. Dark values are tantamount to low confidence. Bright values are tantamount to high confidence. Confidence ranges from 0 (lowest) to 1 (highest). The lower triangular part of the matrix is an area with the highest confidence (1) since we know—and assume so in the model—that for cascade networks those links must be =0
Fig. 6.
Post-inference analysis of an inferred cascade network. Dark values are tantamount to low confidence. Bright values are tantamount to high confidence. Confidence ranges from 0 (lowest) to 1 (highest). The lower triangular part of the matrix is an area with the highest confidence (1) since we know—and assume so in the model—that for cascade networks those links must be =0

For the use of the SelectBoost algorithm during the fitting step of a cascade network reverse-engineering, we used the Patterns package. Benchmark results were reported as sensitivity, positive predictive value and F -score, shown in Figure 7; the code, the simulation details and the remaining results are part of a vignette of the package ‘Benchmarking the SelectBoost Package for Network Reverse Engineering’, that is available at https://fbertran.github.io/SelectBoost/articles/benchmarking-SelectBoost-networks.html.


F-score as a function of the thresholding value: if an inferred coefficient for the network is less than the thresholding value, then it is set to 0. The SelectBoost algorithm is compared to both stability selection and the regular lasso. The upper row displays results for the unweighted version of the algorithms, whereas the lower row displays results for their weighted counterparts
Fig. 7.
F-score as a function of the thresholding value: if an inferred coefficient for the network is less than the thresholding value, then it is set to 0. The SelectBoost algorithm is compared to both stability selection and the regular lasso. The upper row displays results for the unweighted version of the algorithms, whereas the lower row displays results for their weighted counterparts

We created an unweighted or a weighted version of the algorithm. The weighted version of the algorithm enables the user to include weights in the model, which means to favor or disfavor some links between the actors, in order, for instance, to take into account biological knowledge.

The results shown in Figure 7 of the simulation study are a comparison to a standard set up for stability selection and regular lasso both for an unweighted version of the algorithms and a highly correctly weighted version of the same algorithms.

By highly correctly weighted, we mean that we included influential weights in the model accordingly to the links that existed in the network that was used for data simulation. This network was randomized from one simulation to another. This weighted setting was used to determine if including correct biological knowledge would help the reverse-engineering algorithm to retrieve the correct network. If correct biological knowledge is included in the model, all three fitting functions lead to similar and outstanding results for the F-score criterion without even requiring the need to search for an optimal thresholding value as we had to do with the Cascade package.

For each simulated dataset, vertical dots are displayed to show the optimal threshold level that should be used to maximize the F-score. It is computed with respect to the actual values that are unknown for real datasets. Without weights, SelectBoost shrinks the range of optimal values when compared to the lasso or stability selection. With correct weights, none of the methods still requires to use a cut-off value to maximize F-score.

In an unweighted setting, the SelectBoost version of the fitting process shows better performance than stability selection and the lasso as long as the cut-off value is <0.4, which is about the double of the optimal thresholding value.

6 Conclusion

We introduce the SelectBoost algorithm that relies intensive computations to select variables with high precision (PPV). The user of SelectBoost can apply this algorithm to produce a confidence index or choose an appropriate precision-selection trade-off to select variables with high confidence and avoid selecting non-predictive features. The main idea behind our algorithm is to take into account the correlation structure of the data and thus use intensive computation to select reliable variables.

The choice of the threshold ζmin is critical since such a choice leads to two effects.

  • With a high threshold value—nearing the maximum value of 1—: an increase of the PPV while limiting the decrease of the F-score.
  • With a low or medium threshold value—nearing the mid value of 0.5—: an increase in recall while limiting the decrease of the F-score.

We will want the first property to retrieve the stable core of the predictors for models that are known to randomly choose between correlated variables, such as the lasso or adaptive lasso. . Whereas, we will want the second property for models that scarcely select variable, such as variable selection model using variational approximation methods for binary response (varbvs). A non-constant threshold should be also and investigated by those that would like to introduce corrections, for instance FDR-like, such as Holm–Bonferroni, in the variable selection process.

We prove the performance of our algorithm through simulation studies in various settings. To get the best results, we recommend the use of c0 in the range of mean(q90%,q100%) to q100% with the non-increasing post-processing step. It could be useful to decrease the lower bound to q90% for the smallest datasets. The user should never use a c0 value too close to the empty model zone to avoid a decrease in precision. We succeed in improving the PPV, whenever it was possible, of all the 12 selection methods with relative stability on recall and F-score. If the PPV was already nearing 1, then there is almost no negative effect on the PPV and recall when applying SelectBoost.

Our results open the perspective of a precision-selection trade-off which may be very useful in some situations where many regressions have to be made (e.g. network reverse-engineering with one regression made per node of the network). In such a context, our algorithm may even be used in an experimental design approach.

The application to three real datasets allowed us to show that the most reliable variables are not necessarily those with the highest coefficients. The SelectBoost algorithm is a powerful tool that can be used in every situation where reliable and robust variable selection has to be made.

Funding

This work was supported by grants from the Agence Nationale de la Recherche (ANR) [ANR-11-LABX-0070_TRANSPLANTEX]; the INSERM [UMR_S 1109]; the Institut Universitaire de France (IUF) and the MSD-Avenir grant AUTOGEN, all to S.B.; the European regional development fund (European Union) INTERREG V program (project number 3.2 TRIDIAG) to R.C. and S.B.; the Agence Nationale de la Recherche (ANR) [ANR-11-LABX-0055_IRMIA]; the CNRS [UMR 7501] to F.B. and M.M.-B.; and by the French HPC Center ROMEO [UR 201923174L] to F.B.

Conflict of Interest: none declared.

References

Abramowitz M. , StegunI.A. (1972) Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Vol. 55, National Bureau of Standards Applied Mathematics Series, Dover Publications Inc., New York.
Akaike H. (1974) A new look at the statistical model identification. IEEE Trans. Automat. Contr., 19, 716723.
Bach F.R. et al (2008) Bolasso: model consistent lasso estimation through the bootstrap. In: Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland. pp. 3340.
Bair E. et al (2006) Prediction by supervised principal components. J. Am. Stat. Assoc., 101, 119137.
Barabási A.L. et al (2003) Emergence of scaling in complex networks. In: BornholdtS., SchusterH.G. (eds) Handbook of Graphs and Networks: From the Genome to the Internet. Wiley-VCH, Weinheim, pp. 6984.
Bastien P. et al (2015) Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data. Bioinformatics, 31, 397404.
Boulesteix A.-L. (2014) Accuracy estimation for PLS and related methods via resampling-based procedures. In: PLS–14 Book of Abstracts, Paris, France. pp. 1314.
Bourgon R. et al (2010) Independent filtering increases detection power for high-throughput experiments. Proc. Natl. Acad. Sci. USA, 107, 95469551.
Chen L. et al (2007) Noise-based feature perturbation as a selection method for microarray data. In: Bioinformatics Research and Applications, Atlanta, GA, USA. pp. 237247.
Chen S.S. et al (2001) Atomic decomposition by basis pursuit. SIAM Rev., 43, 129159.
Carbonetto P. , StephensM. (2012) Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal., 7, 73108.
Chun H. , KelesS. (2010) Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Series B Stat. Methodol., 72, 325.
Clauset A. et al (2004) Finding community structure in very large networks. Phys. Rev. E, 70, 066111.
Cook J.R. , StefanskiL.A. (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Am. Stat. Assoc., 89, 13141328.
Dettling M. (2004) BagBoosting for tumor classification with gene expression data. Bioinformatics, 20, 35833593.
Donoho D.L. , EladM. (2003) Optimally sparse representation in general (nonorthogonal) dictionaries via L1 minimization. Proc. Natl. Acad. Sci. USA, 100, 21972202.
Efron B. et al (2004) Least angle regression. Ann. Stat., 32, 407499.
Eklund M. , ZwanzigS. (2012) SimSel: a new simulation method for variable selection. J. Stat. Comput. Simul., 82, 515527.
Fan J. (1997) Comments on “Wavelets in statistics: a review” by A. Antoniadis. Stat. Meth. Appl., 6, 131138.
Fan J. , LiR. (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Proceedings International Congress of Mathematicitans. Vol. 3, pp. 595622, European Mathematical Society, Zúrich.
Fan J. , LvJ. (2010) A selective overview of variable selection in high dimensional feature space. Stat. Sin., 20, 101148.
Friedman J. et al (2010a) A note on the group lasso and a sparse group lasso. arXiv preprint arXiv: 1001.0736.
Friedman J. et al (2010b) Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33, 122.
Golub T.R. et al (1999) Molecular classification -of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531537.
Guan Y. , StephensM. (2011) Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat., 5, 17801815.
Hocking R.R. (1976) A Biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics, 32, 149.
Hoerl A.E. , KennardR.W. (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, 5567.
Hugo W. et al (2016) Genomic and transcriptomic features of response to anti-PD-1 therapy in metastatic melanoma. Cell, 165, 3544.
Jung N. et al (2014) Cascade: a R package to study, predict and simulate the diffusion of a signal through a temporal gene network. Bioinformatics, 30, 571573.
Koza J.R. et al (1999) Genetic Programming as a Darwinian Invention Machine. Springer, Heidelberg.
Lipshutz R.J. et al (1999) High density synthetic oligonucleotide arrays. Nat. Genet., 21, 2024.
Luo X. et al (2006) Tuning variable selection procedures by adding noise. Technometrics, 48, 165175.
Magnanensi J. et al (2017) A new universal resample-stable bootstrap-based stopping criterion for PLS component construction. Stat. Comput., 27, 757718.
Meinshausen N. , BühlmannP. (2010) Stability selection. J. R. Stat. Soc. Series B Stat. Methodol., 72, 417473.
Morgan D. et al (2019) A generalized framework for controlling FDR in gene regulatory network inference. Bioinformatics, 35, 10261032.
Natarajan B.K. (1995) Sparse approximate solutions to linear systems. SIAM J. Comput., 24, 227234.
Peng C.J. et al (2002) An introduction to logistic regression analysis and reporting. J. Educ. Res., 96, 314.
Rau A. et al (2013) Data-based filtering for replicated high-throughput transcriptome sequencing experiments. Bioinformatics, 29, 21462152.
Ritchie M.E. et al (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res., 43, e47.
Schwarz G. (1978) Estimating the dimension of a model. Ann. Stat., 6, 461464.
Segal M.R. et al (2003) Regression approaches for microarray data analysis. J. Comput. Biol., 10, 961980.
Sra S. (2012) A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of I s (x). Comput. Stat., 27, 177190.
Tibshirani R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Methodol., 58, 267288.
Vallat L. et al (2013) Reverse-engineering the genetic circuitry of a cancer cell with predicted intervention in chronic lymphocytic leukemia. Proc. Natl. Acad. Sci. USA, 110, 459464.
Wang S. et al (2011) Random lasso. Ann. Appl. Stat., 5, 468485.
Wu Y. et al (2007) Controlling variable selection by the addition of pseudovariables. J. Am. Stat. Assoc., 102, 235243.
Yuan M. , LinY. (2006) Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B Stat. Methodol., 68, 4967.
Zhang C. (2010) Nearly unbiased variable selection under minimax concave penalty. Ann. Stat., 38, 894942.
Zhao P. , YuB. (2006) On model selection consistency of lasso. J. Mach. Learn. Res., 7, 25412563.
Zhou X. et al (2013) Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet., 9, e1003264.
Zou H. (2006) The adaptive lasso and its oracle properties. J. Am. Stat. Assoc., 101, 14181429.
Zou H. , HastieT. (2005) Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol., 67, 301320.
https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
https://www.researchpad.co/tools/openurl?pubtype=article&doi=10.1093/bioinformatics/btaa855&title=selectBoost: a general algorithm to enhance the performance of variable selection methods&author=&keyword=&subject=Original Papers,Gene Expression,AcademicSubjects/SCI01060,