Link to Jupyter notebook: 2.feature-selection.ipynb

In machine learning, feature selection is need for several reasons:

(Interpretation) We want to know which features are most relevant.

(Computation) With fewer features, computation is more efficient in terms of time and memory.

(Performance) After filtering irrelevant features, the model is less prone to overfitting.

In general, feature selection methods fall into three categories:

**Filter methods**

Feature selection is performed independently of the predictor. Usually filter methods are the fastest and simplest way do feature selection. Every feature or a group of features is evaluated in each step, and features that pass a given criteron are kept.

**Wrapper methods**

Different from filter methods in that a complex predictor is used to rank the features, which combines all features to make a prediction. Ideally, all subsets can be evaluated to get the best subset. However, greedy algorithms (e.g. sequential feature selection) and stochastic algorihtms are often used instead of best subset selection because the number of subsets increase exponentially with the number of features. Wrapper methods may be stuck in local minima if they cannot explore the whole feature subset space. Wrapper methods often require large computation resources especially the computational complexity of the predictor is high.

**Embedded methods**

Feature selection is embedded into a prediction algorithm, which is typically implemented by adding a regularization term, either explicitly or implicitly. Feature interdependence is considered by embedded methods. Compared to wrapper methods, the computational complexity embedded methods is lower.

%pylab inlineimport pandas as pd# For machine learningfrom sklearn.datasets import make_classification, make_regressionfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.cluster import KMeansfrom sklearn.svm import SVC, LinearSVCfrom sklearn.gaussian_process import GaussianProcessClassifierfrom sklearn.decomposition import PCAfrom sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_scorefrom sklearn.metrics import accuracy_score, roc_auc_score, f1_score, recall_score, precision_score, \roc_curve, precision_recall_curve, average_precision_score, matthews_corrcoef, confusion_matrix# For plottingimport seaborn as snssns.set()sns.set_style('whitegrid')

Populating the interactive namespace from numpy and matplotlib

To make the randomly generated dataset reproducible, we fixed seed of the numpy pseudo-random number generator.

random_state = np.random.RandomState(654321)

n_samples = 1000n_features = 50X, y = make_classification(n_samples=n_samples, n_classes=2, n_features=n_features,n_informative=3, n_redundant=2, n_clusters_per_class=2,class_sep=0.8, random_state=random_state)

Univariate feature selection removes features by calculating a statistical score for individual features only select features with high-ranking scores. Scores for each feature is calculated independently, make this method very efficient and can be performed in parallel. Univariate feature selection is useful when the number of features is very large, compared to the number samples.

In univariate feature selection, scores are typically evaluated between input features and class labels/target variables. For classification problems, continous features can be scored using various statistical test, including Mann-Whitney U tests, t-tests, ANOVA models, area under the ROC curve (AUROC). ANOVA models can accomodate multi-class labels. For discrete features, chi-square test, mutual information or gini index are useful criteria.

For regression problems, correlation coefficient or F-tests can be used for feature selection.

Unsupervised filters do not use information about target variables, but can greatly reduce the number of features and is less prone to overfitting. Commonly used filters include non-zero fraction and variance. Sparse datasets contain large numbers of missing/zero values and features with too many missing values can be removed. Features with larger variance are probably more informative about target values.

After calculating scores for each feature, we can simply select features according to k highest scores or a fixed threshold. If p-values of a statistical test are available, we can select p-values according to a given false positive rate (FPR), false discovery rate (FDR) or Family-wise error rate (FWE).

Features according to false positive rate (FPR) are selected as: $p(x_k) \le \alpha$

Features according to family-wise error rate (FWER) (also called Bonferroni correction) are selected as: $p(x_k) \le \frac{1}{M} \alpha$

Features according to false discovery rate (FDR) are selected by first sorting pvalues in accending order and then select significant features according the following criterion (Benjamini-hochberg procedure): $p(x_k) \le \frac{k}{M} \alpha$

Because univariate feature selection considers each feature independently, redundant features may be selected.

from sklearn.feature_selection import GenericUnivariateSelect, f_classifselector = GenericUnivariateSelect(f_classif, mode='k_best', param=10)selector.fit(X, y)

GenericUnivariateSelect(mode='k_best', param=10,score_func=<function f_classif at 0x7efce2631b70>)

plot_data = pd.DataFrame({'-log10(p-value)': -np.log10(selector.pvalues_),'Feature': np.arange(n_features),'Selected': selector.get_support()})fig, ax = plt.subplots(figsize=(14, 7))sns.barplot(x='-log10(p-value)', y='Feature', hue='Selected', orient='h',data=plot_data, ax=ax)ax.set_yticks([])

[]

from sklearn.feature_selection import f_regressionn_samples = 1000n_features = 50X_reg, y_reg = make_regression(n_samples=n_samples, n_features=n_features,n_informative=3, noise=1, random_state=random_state)selector = GenericUnivariateSelect(f_regression, mode='k_best', param=10)selector.fit(X_reg, y_reg)plot_data = pd.DataFrame({'-log10(p-value)': -np.log10(selector.pvalues_),'Feature': np.arange(n_features),'Selected': selector.get_support()})fig, ax = plt.subplots(figsize=(14, 7))sns.barplot(x='-log10(p-value)', y='Feature', hue='Selected', orient='h',data=plot_data, ax=ax)ax.set_yticks([])

[]

It is a common mistake to do feature selection on the whole dataset prior to external validation such as cross-validation. It should be realized that many feature selection process already uses information of class labels/target variables and can overfit. Feature selection should be included as part of model training process and only use training data.

A typical workflow is:

scores = [0 for i in range(resample_rounds)]for i in range(resample_rounds):X_train, y_train, X_test, y_test = resample(X, y)features = select_features(X_train, y_train)predictor = Predictor()predictor.fit(X_train[:, features], y_train)y_pred = predictor.predict(X_test[:, features])scores[i] = evaluate(y_test, y_pred)mean_score = mean(scores)

For unsupervised feature selection methods, feature selection can be performed outside the validation loop.

from sklearn.feature_selection import RFECVmodel = LinearSVC()cv = RFECV(model, cv=3, step=0.1)cv.fit(X, y)fig, ax = plt.subplots(figsize=(14, 6))feature_importances = np.zeros(n_features)feature_importances[cv.support_] = np.abs(np.ravel(cv.estimator_.coef_))sns.barplot(x=feature_importances, y=np.arange(n_features), color='gray', orient='h',ax=ax)ax.set_yticks([])ax.set_xlabel('Feature importance')ax.set_ylabel('Feature')

Text(0,0.5,'Feature')

Sequential feature selection (SFS) is a greedy algorithm for best subset feature selection. SFS is wrapper method that ranks features according to a prediction model. SFS starts with an empty feature set. During each step, SFS tries to add a feature from remaining features to the current feature set and train the predictor on the new feature set. A given metric (e.g. accuracy) is used to evaluate the prediction performance on the feature set. Optionally, cross-validation can be used to prevent overfitting. After enumerating all remaining features, SFS keeps a feature with the highest prediction performance and add the feature to the current feature set. The feature stops when a desired number of features are selected.

Because SFS retrain the model for every remaining feature and for every step, it demands huge computational resources when the desired number of features is large. For this reason, SFS is recommended only when we want to select a small number of features.

Due to the greedy nature of the SFS algorithm, the selected feature subset is not optimal. To overcome local minima, stochastic algorithms such as genetic algorithm (GA) or other types of randomness can be included in the algorithm.

from mlxtend.feature_selection import SequentialFeatureSelectormodel = LogisticRegression()sfs = SequentialFeatureSelector(model, k_features=5,forward=True, floating=False, scoring='accuracy', cv=5)sfs.fit(X, y)added_features = []for i in range(1, len(sfs.subsets_) + 1):if i == 1:added_features.append(sfs.subsets_[i]['feature_idx'][0])else:added_features.append(list(set(sfs.subsets_[i]['feature_idx']) - set(sfs.subsets_[i - 1]['feature_idx']))[0])added_featuresplot_data = pd.DataFrame(np.column_stack([sfs.subsets_[i]['cv_scores'] for i in sfs.subsets_]))plot_data = pd.melt(plot_data, var_name='n_features', value_name='accuracy')fig, ax = plt.subplots(figsize=(10, 7))sns.stripplot('n_features', 'accuracy', color=(0.6, 0.6, 1.0), data=plot_data, ax=ax)sns.pointplot('n_features', 'accuracy', color=(0.3, 0.3, 1.0), data=plot_data,join=False, ci=None, ax=ax, markers='d')ax.set_xticklabels(['{} ({})'.format(i + 1, feature) for i, feature in enumerate(added_features)])ax.set_xlabel('n_features (added feature)')ax.set_ylim(0.5, 1)

(0.5, 1)

LASSO is a linear model with L1 regularization on the coefficients. $\text{Loss} = \sum_{i=1}^N error(\mathbf{w}^{T} \mathbf{x}_i + b, y_i) + \alpha ||\mathbf{w}||_1$ where $||\mathbf{w}||_1 = \sum_{k=1}^M |w_k|$

The error function depends on the prediction problem. For classification, a Logistic/sigmoid function transforms the linear combination to a class probability between 0 and 1. Then a negative log-likelihood/cross-entropy between predicted probability and true class labels is used as the error function. For regression, squared error is used as the error function.

LASSO generate sparse coefficients (with many zero coefficients) through the L1 regularization term. The loss function of LASSO can be rewritten as a constrained optimization problem. Because the feasible region contains many corners anchored at the axes (zero values on some dimensions), there is high probability that the contour of an objective function will hit one of the corners. The regularization strength $\alpha$ controls the sparseness of the solution. Typically, grid search is used to optimize hyper-parameter $\alpha$ according to cross-validation scores. $\alpha$ is commonly choosen on a exponential scale (e.g. from $10^{-6}$ to $10^{6}$).

from sklearn.linear_model import LogisticRegressionCVmodel = LogisticRegressionCV(Cs=10**np.arange(-6, 7, dtype='float'),penalty='l1', dual=False, solver='liblinear')model.fit(X, y)print('Optimized C = {}'.format(model.C_[0]))fig, ax = plt.subplots(figsize=(14, 6))sns.barplot(x=np.abs(np.ravel(model.coef_)), y=np.arange(n_features), color='gray', orient='h',ax=ax)ax.set_yticks([])ax.set_xlabel('Feature importance')ax.set_ylabel('Feature')

Optimized C = 0.1Text(0,0.5,'Feature')

Elastic net is also a linear model, but combines L1 and L2 regularization. L2 regularization penalize more on coefficient values to prevent overfitting and colinearity (linear correlation between features). L1 regularization focus more on sparseness, but is susceptible to colinearity and may also generate large weights.

$$ \text{Loss} = \sum_{i=1}^N \text{error}(\mathbf{w}^{T} \mathbf{x}_i + b, y_i)

\alpha [ \theta \cdot ||\mathbf{w}||_1

(1 - \theta) \cdot ||\mathbf{w}||_2^2 ] $$

where

$||\mathbf{w}||_2^2 = \sum_{k=1}^M w_k^2$

Elastic net contains two hyper-paremeters: $\alpha$ and $\theta$. The additional parameter $\theta$ controls the ratio between L1 and L2 regularization. When $\theta = 1$, it becomes LASSO. When $\theta = 0$, it becomes RIDGE (with only L2 regularization).

from sklearn.linear_model import ElasticNetCVmodel = ElasticNetCV(l1_ratio=0.5)model.fit(X, y)fig, ax = plt.subplots(figsize=(14, 6))sns.barplot(x=np.abs(np.ravel(model.coef_)), y=np.arange(n_features), color='gray', orient='h',ax=ax)ax.set_yticks([])ax.set_xlabel('Feature importance')ax.set_ylabel('Feature')

Text(0,0.5,'Feature')

Decision tree is a rule-based predictor. Each rule is a criterion on one feature. Each rule/node separate the samples into two subsets with increases homogeneity/purity. Each node grows until all samples are perfectly separated (all leave nodes contains samples from a single class) or a given maximum depth is reached.

If a subset of features is sufficient to separate all samples, these features formed the selected feature set. Alternatively, we can select features according to feature importance. After a decision tree has been built, feature importances are calculated as the sum of weighted number of rules used in the tree.

from sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier()cv = GridSearchCV(model, param_grid={'max_depth': list(range(1, 10))}, cv=3)cv.fit(X, y)model = cv.best_estimator_fig, ax = plt.subplots(figsize=(14, 6))sns.barplot(x=np.abs(np.ravel(model.feature_importances_)), y=np.arange(n_features), color='gray', orient='h',ax=ax)ax.set_yticks([])ax.set_xlabel('Feature importance')ax.set_ylabel('Feature')

Text(0,0.5,'Feature')

We can use the sklearn.tree.export_graphviz to visualize the rules in the tree.

import graphvizfrom sklearn.tree import export_graphvizfrom wand.image import Image as WImagegraph_data = export_graphviz(model, out_file=None, filled=True, rounded=True)graph = graphviz.Source(graph_data)graph.render('decision_tree')WImage(filename='decision_tree.pdf', resolution=72)

Random forest is an ensemble algorithm that combines predictions of many decision trees. Randomness is incorporated into each decision tree by bootstraping training samples or tree building process. A single decision tree is easy to overfit because it tries to separate samples perfectly but realistic data usually contains noise features. Due to randomness in training individual decision trees, each tree may give different predictions for a sample. These predictions/votes are combined through majority rule. The fraction of decision trees that contribute to the prediction defines the confidence/probability of the prediction. The ensemble strategy inherits the advantage of decision tree in non-learn modeling of data and is robust against overfitting, which makes random forest an off-the-shelf algorithm for a general machine learning problem.

This example uses grid search with cross-validation to optimize two hyper-parameters: max tree depth and number of trees. Finally, it computes feature importances by averageing feature importances of individual trees. The feature importances are not as sparse as decision trees, but a few features stand out among others.

from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier()cv = GridSearchCV(model, param_grid={'max_depth': list(range(1, 5)),'n_estimators': list(range(10, 60, 10))},cv=3)cv.fit(X, y)model = cv.best_estimator_print('Optimized hyper-parameters: max_depth = {}, n_estimators = {}'.format(cv.best_estimator_.max_depth, cv.best_estimator_.max_depth))fig, ax = plt.subplots(figsize=(14, 6))sns.barplot(x=np.abs(np.ravel(model.feature_importances_)), y=np.arange(n_features), color='gray', orient='h',ax=ax)ax.set_yticks([])ax.set_xlabel('Feature importance')ax.set_ylabel('Feature')

Optimized hyper-parameters: max_depth = 4, n_estimators = 4Text(0,0.5,'Feature')

Understand and run code in this tutorial on a Jupyter notebook.

Try univariate feature selection using FDR, FWER, FPR and compare features selected by each method.

Implement recursive feature elimination (RFE) by yourself. Eliminate 10% features in each round. Plot cross-validation scores (ROAUC) vs number of features.

Try varying the regularization parameter ($\alpha$ and $\theta$) in LASSO and elastic net and see how they affect the selected features and coefficients.

Chandrashekar, G., and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering 40, 16–28.

Guyon, I., and Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182.

Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517.