Technological innovations make it possible to measure large amounts of data in a single observation. As a consequence, problems in which the number P of variables is larger than the number N of observations have become common. As reviewed by Fan and Li (2006), such situations arise in many fields from fundamental sciences to social science, and variable selection is required to tackle these issues. For example, in biology/medicine, thousands of messenger RNA (mRNA) expressions (Lipshutz et al., 1999) may be potential predictors of some disease. Moreover, in such studies, the correlation between variables is often very strong (Segal et al., 2003), and variable selection methods often fail to make the distinction between the informative variables and those which are not. Similarly, inference of gene regulatory networks from perturbation data can enhance the insights of a biological system (Morgan et al., 2019). In this article, we propose a general algorithm that enhances model selection in correlated variables.
First, we will assume a statistical model with a response variable (with the symbol ‘’’ as the transposed), a variable matrix of size N × P, and a vector of parameters . Then, we will assume that the vector of parameters is sparse. In other words, we will assume that except for a quite small proportion of elements of the vector. We note as the set of indices for which and is the cardinality of this set . Without any loss of generality, we will assume that if and only if .
When dealing with a problem of variable selection, one of the goals is the estimation of the support, in which you want to be close to one, with . Here, our interest is mainly as follows, i.e. in identifying the correct support . This kind of issue arises in many fields, e.g. in biology, where it is of greatest interest to discover which specific molecules are involved in a disease (Fan and Li, 2006).
There is a vast literature dealing with the problem of variable selection in both statistical and machine-learning areas (Fan and Li, 2006; Fan and Lv, 2010). The main variable selection methods can be gathered in the common framework of penalized likelihood. The estimate is then given by:
Many different penalties can be found in the literature. Solving this problem with as part of the penalty is an NP-hard problem (Fan and Lv, 2010; Natarajan, 1995). It cannot be used in practice when P becomes large, even when it is employed with some search strategy like forward regression, stepwise regression (Hocking, 1976) and genetic algorithms (Koza et al., 1999). Donoho and Elad (2003) showed that relaxing to norm ends, under some assumptions, to the same estimation. This result encourages the use of a wide range of penalties based on different norms. For example, the case where is the lasso estimator (Tibshirani, 1996) [or equivalently Basis Pursuit Denoising (Chen et al., 2001)] whereas leads to the Ridge estimator (Hoerl and Kennard, 1970). Nevertheless, the penalty term induces variable selection only if:
Equation (3) explains why the lasso regression allows for variable selection, while the Ridge regression does not. The lasso regression is, however, known to lead to a biased estimate (Zou, 2006). The Smoothly Clipped Absolute Deviation (SCAD) (Fan, 1997), Minimax Concave Penalty (Zhang, 2010) or adaptive lasso (Zou, 2006) penalties all address this problem. The popularity of such variable selection methods is linked to fast algorithms like Least-Angle Regression Selection (Efron et al., 2004), coordinate descent or Penalized Linear Unbiased Selection (Zhang, 2010).
Nevertheless, the goal of identifying the correct support of the regression is complicated and the reason why variable selection methods fail to select the set of non-zero variables can be summarized in two words: linear correlation. Choosing the lasso regression as a special case, Zhao and Yu (2006) stated that if an irrelevant predictor is highly correlated with the predictors in the true model, lasso may not be able to distinguish it from the true predictors with any amount of data and any amount of regularization. Zhao and Yu (2006) [and simultaneously Zou (2006)] found an almost necessary and sufficient condition for lasso sign consistency (i.e. selecting the non-zero variables with the correct sign). This condition is known as ‘irrepresentable condition’:
Facing this issue, existing variable selection methods can be split into two categories:
The former group can further be split into methods in which groups of correlation are known, such as the group lasso (Friedman et al., 2010a; Yuan and Lin, 2006) and those in which groups are not known as in the elastic net (Zou and Hastie, 2005). The latter combines the and the norm and takes advantage of both. Non-regularized methods will select some co-variables among a group of correlated variables while regularized methods will select all variables in the same group with similar coefficients.
The main idea of our algorithm is to consider that any observed value of a group of linearly correlated variables of the X matrix is the independent realization of a given random function. This common random function is then used to perturb the observed values of the relevant correlated variables. Strictly speaking, the use of noise to determine the informative variables is not a new idea. For example, it has been shown that adding random pseudo-variables decreases over-fitting (Wu et al., 2007). In the case where P > N, the pseudo-variables are generated either with a standard normal distribution or by using permutations on the matrix X (Wu et al., 2007). Another approach consists of adding noise to the response variable and leads to similar results (Luo et al., 2006). The rationale of this method is based on the work of Cook and Stefanski (1994), which introduces the simulation-based algorithm SIMEX (Cook and Stefanski, 1994). Adding noise to the matrix X has already been used in the context of microarrays (Chen et al., 2007). Simsel (Eklund and Zwanzig, 2012) is an algorithm that both adds noise to variables and uses random pseudo-variables. One new and inspiring approach is stability selection (Meinshausen and Bühlmann, 2010) in which the variable selection method is applied on sub-samples, and informative variables are defined as variables which have a high probability of being selected. Bootstrapping has been applied to the lasso on both the response variable and the matrix X with better results in the former case (Bach et al., 2008). A random lasso, in which variables are weighted with random weights, has also been introduced (Wang et al., 2011).
In this article, following the idea of using simulation to enhance the variable selection methods, we propose the SelectBoost algorithm. Unlike other algorithms reviewed above, it takes into account the correlation structure of the data. Furthermore, our algorithm is motivated by the fact that in the case of non-regularized variable selection methods, if a group contains variables that are highly correlated together, one of them will be chosen with precision.
The SelectBoost algorithm has been designed in a general framework in order to avoid to select non-predictive correlated features. The main goal is to improve the predictive positive value (PPV), i.e. the proportion of selected variables which truly belong to .
As we assume that the variables are centered and that for , we know that . Indeed, the normalization puts the variables on the unit sphere . The process of centering can be seen as a projection on the hyperplane with the unit vector as normal vector. Moreover, the intersection between and is . We further define the following isomorphism:
Here, we make the assumption that a group of correlated variables are independent realizations of the same multivariate Gaussian distribution. As the variables are normalized with respect to the norm, we will use the von Mises–Fisher distribution (Sra, 2012) in thanks to the isomorphism in order to generate new perturbed design matrix. The probability density function of the von Mises–Fisher distribution for the random P-dimensional unit vector is given by:
The multivariate Gaussian distribution assumption is not restrictive. As long as the group of correlated variables is independent realizations of the same distribution, the SelectBoost algorithm can be applied: either directly to assess the stability of the selected variables with perturbed datasets with an increasing noise level, which is the core idea behind the SelectBoost algorithm, or after replacing the von Mises–Fisher distribution with a more relevant one.
To use the SelectBoost algorithm, we need a grouping method depending on a user-provided constant . This constant determines the strength of the grouping effect. The grouping method maps each variable index to an element of [with the powerset of the set S, i.e. the set which contains all the subsets of S]. Concretely, is the set of all variables, which are considered to be linked to the variable and is the submatrix of X containing the columns which indices are in . We impose the following constraints to the grouping function:
Furthermore, we need to have a selection method:
The SelectBoost algorithm returns the vector . Each of these values has to be compared to a threshold to determine which variables are selected: we choose to select a variable p if . The simulation study showed that the choice of the threshold is critical and the algorithm can be improved if we enforce that the ζp values—as functions of c0 —are non-increasing, see Figure 1 bottom. This additional requirement makes sense: the more variables the resampling process involves—with smaller c0— the less a given variable will be selected.
We first have to choose the grouping function. One of the simplest ways to define a grouping function is the following:
In other words, the correlation group of the variable p is determined by variables whose correlation with is at least c0. In another way, the structure of correlation may further be taken into account using graph community clustering. Let be the correlation matrix of matrix . Let define as follows:
Then, we apply a community clustering algorithm on the undirected network with weighted adjacency matrix defined by . Using a graph community clustering algorithm is helpful with large datasets while still clustering similar variables together. For instance, the fast greedy modularity optimization algorithm for finding community structure (Clauset et al., 2004) runs in essentially linear time for many real-world networks given that they are sparse and hierarchical.
Once the grouping function is chosen, we have to choose parameter c0 . Due to the constraints in Equation (6), the SelectBoost algorithm results in the initial variable selection method when . As we will show in the next section, the smaller the parameter c0, the higher the precision of the resulting selected variables. On the other hand, it is obvious that the probability of choosing none of the variables (i.e. resulting in the choice of an empty set) increases as the parameter c0 decreases. In the perspective of experimental planning, the choice of c0 should result of a compromise between precision and proportion of active identified variables. Hence, the c0 parameter can be used to introduce a confidence index γp related to the variable :
We benchmarked the algorithm with a large simulation study with four data generation processes and three real datasets (Table 1). Generated datasets are available upon request and real datasets are available either upon request or online.
For Types 1 and 2, the number of variables is 1000, and the number of observations is 100. The data are generated from a cluster simulation (Bair et al., 2006; Bastien et al., 2015). Only 50 first predictors are linked to the response Y and the last 950 variables are randomly generated from a standard normal distribution. For Example 3, the response variable is linear but was turned into a binary variable (+1 when and –1 when ).
Examples 1 and 3 are linear regression examples whereas 2, 4, 5, 6 and 7 are logistic regression ones, for which, we will assume a logistic model with a binary response variable (Peng et al., 2002).
We provide results for 12 different settings based on 10 different models, see Supplementary Section S1 for more details.
The SelectBoost algorithm is based on correlated resampling and hence random. We wanted to assess both the stability and performance of the algorithm. As a consequence, for the four types of simulated data, we focused both on what may be called a repeatability study (a given dataset was analyzed 100 times to estimate the variation only due to the fact that the algorithm is random) and a reproducibility study (100 different datasets were generated and analyzed to estimate the variability due to both data simulation—from the same data generation—and the fact that the algorithm is random).
The repeatability issue raised was raised, for instance by Boulesteix (2014) and Magnanensi et al. (2017) for PLS models. For those models, random split cross-validation is known to have poor repeatability. We used two types of grouping functions (either determined by variables whose correlation with xp is at least c0 -gdirect- or community clustering-based -gcc-).
The cost (memory and time) of the random generation step can be limited thanks to a sparse correlated resampling feature. The remaining cost of the algorithm is with B the number of resampling and Nc0 the number of c0 values that are investigated and Time1 the time to fit the model once.
To demonstrate the performance of the SelectBoost method, we compared our method with stability selection (Meinshausen and Bühlmann, 2010) and with a naive version of our algorithm, naiveSelectBoost. The naiveSelectBoost algorithm works as follows: estimate β with any variable selection method then if , as defined in Equation (7) e.g. is not reduced to p, shrink to 0. The naiveSelectBoost algorithm is similar to the SelectBoost algorithm, except that it does not take into account the error, which is made choosing at random a variable among a set of correlated variables.
We use four indicators to evaluate the abilities of our method on simulated data. We define:
Note that our interest is focused on precision, as our goal is to select reliable variables. As stated before, when c0 is decreasing toward zero, we expect a profit in precision and a decrease in recall. We also compute the F-score, which combines both recall and precision. As an improvement of precision comes with a decrease in the number of identified variables, the best method is the one with the highest precision for a given level of selection.
We show the evolution of the four criteria (recall, precision, F-score and selection) with regards to the decrease of c0. When , the SelectBoost algorithm is equivalent to the initial variable selection method. We introduce a post-processing step to enforce that, for a given variable, the proportion of selection is non-increasing. It is the expected behavior since the correlated resampling is not meant to increase the probability of selection for a variable. Such an increase may happen for small c0 values when a variable that is not linked with the response is mixed with a variable that is linked to the response. For all the simulations, this post-processing step increases the PPV of the SelectBoost algorithm, see Figure 1. As our primary focus is PPV, we recommend the use of this post-processing step. More details can be found in the Supplementary Graphs S7–S174.
We created precision-recall plots to display the effects of the algorithm on the performance of all the models and criteria used for a given dataset. Identical model fitting criteria share the same colors. The arrows point toward decreasing c0 values. Direct grouping and community grouping lead to similar results, Figure 2 and Supplementary Figures S1, S3 and S4. These Figures also show that the results for a single dataset repeated 100 times are similar to results for 100 different datasets. The Zoom l sequence, which is a 10-step regular grid from the quantile—the maximum value—to —the quantile of order 0.9—achieves high PPV, Supplementary Figure S299.
Supplementary Figure S5 displays an example of raw SelectBoost (without the non-increasing post-processing step) for direct grouping and 100 different datasets that should be compared to Figure 2. This effect is even stronger smaller values for the threshold. The non-increasing post-processing step greatly improves the results of the algorithm and leads to monotonic relationships between the recall, the precision and the c0 value. PPV benefit less from smaller values for the threshold, Supplementary Figure S6.
All the results of the simulation study showed the good performance and stability of the algorithm, which we then applied once to each of the real datasets. The results of the gdirect and gcc based SelectBoost are similar, the gcc based being a bit more time consuming than the gdirect one. In the following of this article, we reported results and figures for gdirect-based SelectBoost.
According to our simulation studies, Supplementary Graphs S7–S174, one should choose c0 between and , see Figure 1 top. In our simulation studies, we used an 11-steps c0 sequence, but, according to our results, it could be limited to 6 steps [from mean() to ] for the biggest datasets. The value of B should not be lower than 10. B = 50 or B = 100 will provide more stable results. As a consequence, the minimal time cost of the SelectBoost algorithm will be 60 times the time cost of the regular model fit, which could be afforded in almost every case. The parallel processing support of the SelectBoost package can help to reduce this time. Hence, the SelectBoost seems feasible with most of the datasets and even omics datasets as we did in our simulation study with the three real datasets.
Hence, to assess the performance of the SelectBoost algorithm, we performed comprehensive numerical studies. As stated before, the SelectBoost algorithm can be applied to any existing variable selection method.
Figure 1 top shows the result for the lasso selection with a penalty parameter chosen using information criteria for Type 1 datasets. In this example, we improve the precision up to 1. Moreover, as shown by Figures 1 and 3, the proportion of models, for which the precision reaches one, increases with the decrease of c0. The F-score increases, remains either stable or shows a small decrease indicating that the increase of PPV compensates the loss in recall.
In the previous section, we mentioned the possibility of using SelectBoost to obtain a confidence index, corresponding to one minus the lowest c0 for which a variable is selected. For each c0 , we plotted the average number of selected variables as a function of the proportion of correctly identified variables (Fig. 3 and Supplementary Figs S300–S304). As expected, the proportion of correctly identified variables increases with the increase of the confidence index and with the decrease of the average number of identified variables. Therefore, the proportion of non-predictive features decreases with the increase of the confidence index.
The SelectBoost algorithm shows its superiority over the naive SelectBoost algorithm. The error made when choosing a variable randomly among a set of correlated variables leads to more incorrect choices of variables. While the intensive simulation of our algorithm allows taking into account this error, the naiveSelectBoost does not.
Finally, we compare the SelectBoost algorithm with stability selection. Stability selection uses a resampling algorithm to determine which of the variables included in the model are robust. In our simulation, stability selection shows performance with high precision but also low recall. Moreover, in contrast to the SelectBoost algorithm, stability selection does not allow to choose a convenient precision-PPV trade-off.
The timings of the algorithm can be found on Supplementary Figures S175–S222.
We applied our algorithm to three real datasets. We studied, with respect to the threshold, the number of non-zero variables, the number of variables selected by SelectBoost and their ratio. We found results that were concordant with those of the simulated datasets. Figure 4 displays those results for a SGPLS-based SelectBoost of the Leukemia dataset with a 0.25 threshold (). See Supplementary Figures S247–S298.
We report the results for the RNA-Seq dataset providing mRNA expressions from Huntington’s disease and neurologically normal individuals. This dataset was downloaded from the GEO database under accession number GSE64810 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE64810). This dataset contains 20 Huntington’s disease cases and 49 neurologically normal controls and includes 28 087 genes as explanatory variables. An independent filtering (Bourgon et al., 2010) preprocessing step was first performed using data-based filtering for replicated high-throughput transcriptome sequencing experiments (Rau et al., 2013). Then, we applied the lasso selection method to this reduced dataset (see Fig. 5 left for the whole path of the solution). We used cross-validation to choose the appropriate level of penalization [i.e. the λ parameter in Equation (3)].
We then applied our SelectBoost algorithm on the lasso method with penalty parameter chosen by cross-validation. We use a range for the c0 parameter starting from 1 to 0.7 with steps of 0.05, which corresponds to a confidence index from 0 to 0.3. For each step, the probability of being included in the support was calculated with 200 simulations as described in Algorithm 1. We set the threshold of being in the support to 0.95 to avoid numerical instability. We classify the selected variables into three categories: those that are identified for each confidence index from 0 to 0.15 (red), those identified from 0 to 0.25 (orange) and those identified from 0 to 0.3 (green). The last category contains the most reliable variables selected by the SelectBoost algorithm because these variables are identified from low to high confidence index.
With the lasso selection method, 15 variables were selected. Among them, four genes were identified by SelectBoost into the three different categories of confidence index (see Fig. 5 right): two genes for low confidence (red) (ANXA3 and INTS12), one gene for intermediate confidence (orange) (NUB1) and one gene for high confidence (green) (PUS3).
The interesting point, in these three examples, is that the identified variables are neither the first variables selected by the lasso nor the variables with the highest coefficients (see Fig. 5 left). This result demonstrates that our algorithm can be advantageous to select variables with high confidence and not just to select variables with the highest coefficients.
Finally, we decided to assess the differential expression of these genes between patients and controls, using the limma package (Linear Models for Microarray and RNA-Seq Data) (Ritchie et al., 2015). The four identified genes are significantly down-expressed by neurologically healthy controls confirming the result of a logistic model with these four genes.
Sparsity is a well-known feature of most biological networks (Barabási et al., 2003). An actor can only be regulated by a small number of other actors, whereas it may regulate any number of other actors. Hence, variable selection methods, such as the lasso, ensure that sparsity feature and are often core components of most of the biological network reverse-engineering tools. As a consequence, we propose to apply the SelectBoost algorithm in two different ways in order to improve the biological network reverse-engineering: as a post-processing step after the inference was made or during the inference itself in order to select the most stable predictors for each node in the network. When used as a post-processing step, one can assess for any of the inferred links between the actors of the network, its confidence index against correlated resampling of the predictors. When used during the inference step, one can infer a model that is only built with links with a high enough confidence index. The former is implemented in the SelectBoost package as a new method for the Cascade package (Jung et al, 2014) . The latter is implemented in the new Patterns CRAN package as a dedicated fitting function and is especially useful when trying to find targets for biological intervention that are strongly related to markers of some diseases through the reverse-engineered network and useful and reliable links.
We benchmarked those two uses of the algorithm with a particular type of biological networks that we have been using for several years: cascade networks (Vallat et al., 2013).
For the post-inference processing, we first fit a model to a cascade network using the Cascade package inference function. Then, we compute confidence indices for the inferred links using the SelectBoost algorithm, more details, as well as the code, of the simulations can be found in the vignette of the package ‘Towards Confidence Estimates in Cascade Networks using the SelectBoost Package’, available at https://fbertran.github.io/SelectBoost/articles/confidence-indices-Cascade-networks.html. An example of those results is shown in Figure 6 with a cascade network for four time points and four groups of 25 actors.
For the use of the SelectBoost algorithm during the fitting step of a cascade network reverse-engineering, we used the Patterns package. Benchmark results were reported as sensitivity, positive predictive value and F -score, shown in Figure 7; the code, the simulation details and the remaining results are part of a vignette of the package ‘Benchmarking the SelectBoost Package for Network Reverse Engineering’, that is available at https://fbertran.github.io/SelectBoost/articles/benchmarking-SelectBoost-networks.html.
We created an unweighted or a weighted version of the algorithm. The weighted version of the algorithm enables the user to include weights in the model, which means to favor or disfavor some links between the actors, in order, for instance, to take into account biological knowledge.
The results shown in Figure 7 of the simulation study are a comparison to a standard set up for stability selection and regular lasso both for an unweighted version of the algorithms and a highly correctly weighted version of the same algorithms.
By highly correctly weighted, we mean that we included influential weights in the model accordingly to the links that existed in the network that was used for data simulation. This network was randomized from one simulation to another. This weighted setting was used to determine if including correct biological knowledge would help the reverse-engineering algorithm to retrieve the correct network. If correct biological knowledge is included in the model, all three fitting functions lead to similar and outstanding results for the F-score criterion without even requiring the need to search for an optimal thresholding value as we had to do with the Cascade package.
For each simulated dataset, vertical dots are displayed to show the optimal threshold level that should be used to maximize the F-score. It is computed with respect to the actual values that are unknown for real datasets. Without weights, SelectBoost shrinks the range of optimal values when compared to the lasso or stability selection. With correct weights, none of the methods still requires to use a cut-off value to maximize F-score.
In an unweighted setting, the SelectBoost version of the fitting process shows better performance than stability selection and the lasso as long as the cut-off value is <0.4, which is about the double of the optimal thresholding value.
We introduce the SelectBoost algorithm that relies intensive computations to select variables with high precision (PPV). The user of SelectBoost can apply this algorithm to produce a confidence index or choose an appropriate precision-selection trade-off to select variables with high confidence and avoid selecting non-predictive features. The main idea behind our algorithm is to take into account the correlation structure of the data and thus use intensive computation to select reliable variables.
The choice of the threshold is critical since such a choice leads to two effects.
We will want the first property to retrieve the stable core of the predictors for models that are known to randomly choose between correlated variables, such as the lasso or adaptive lasso. . Whereas, we will want the second property for models that scarcely select variable, such as variable selection model using variational approximation methods for binary response (varbvs). A non-constant threshold should be also and investigated by those that would like to introduce corrections, for instance FDR-like, such as Holm–Bonferroni, in the variable selection process.
We prove the performance of our algorithm through simulation studies in various settings. To get the best results, we recommend the use of c0 in the range of mean() to with the non-increasing post-processing step. It could be useful to decrease the lower bound to for the smallest datasets. The user should never use a c0 value too close to the empty model zone to avoid a decrease in precision. We succeed in improving the PPV, whenever it was possible, of all the 12 selection methods with relative stability on recall and F-score. If the PPV was already nearing 1, then there is almost no negative effect on the PPV and recall when applying SelectBoost.
Our results open the perspective of a precision-selection trade-off which may be very useful in some situations where many regressions have to be made (e.g. network reverse-engineering with one regression made per node of the network). In such a context, our algorithm may even be used in an experimental design approach.
The application to three real datasets allowed us to show that the most reliable variables are not necessarily those with the highest coefficients. The SelectBoost algorithm is a powerful tool that can be used in every situation where reliable and robust variable selection has to be made.
This work was supported by grants from the Agence Nationale de la Recherche (ANR) [ANR-11-LABX-0070_TRANSPLANTEX]; the INSERM [UMR_S 1109]; the Institut Universitaire de France (IUF) and the MSD-Avenir grant AUTOGEN, all to S.B.; the European regional development fund (European Union) INTERREG V program (project number 3.2 TRIDIAG) to R.C. and S.B.; the Agence Nationale de la Recherche (ANR) [ANR-11-LABX-0055_IRMIA]; the CNRS [UMR 7501] to F.B. and M.M.-B.; and by the French HPC Center ROMEO [UR 201923174L] to F.B.
Conflict of Interest: none declared.