2.3 Feature Selection
Recommended machine learning packages
scikit-learn
Link: (https://scikit-learn.org/stable)
scikit-learn has a unified and simple interface for machine learning algorithms. You can also write your own algorithms under the scikit-learn
skrebate
Link: (https://epistasislab.github.io/scikit-rebate/)
skrebate is RELIEF-based feature selection confining the scikit-learn interface. Implemented algorithms: ReliefF, SURF, SURF_, MultiSURF, MultiSURF_, TuRF
mlextend
Link: (http://rasbt.github.io/mlxtend/)
Additional machine learning algorithms using scikit-learn framework. For example,
Feature selectors: ColumnSelector, ExhaustiveFeatureSelector, SequentialFeatureSelector (forward selection, backward selection)
Classifiers: LogisticRegression, Perceptron, MultiLayerPerceptron, EnsembleVoteClassifier
SciKit-Learn Laboratory (SKLL)
Command line wrapper for scikit-learn. You can quickly test machine learning algorithms without writing Python code.
Link: (https://skll.readthedocs.io/en/latest/index.html)
imbalanced-learn
Techniques for handling datasets with imbalanced class ratio. Including over-sampling, under-sampling and ensemble methods.
Link: (https://imbalanced-learn.readthedocs.io/en/stable/)
Other machine learning packages
Machine learning packages similar to scikit-learn: (https://scikit-learn.org/stable/related_projects.html)
Example input files
Expression matrix
Example filename: data/example/matrix.txt
.
A tab-separated text file: columns are samples and rows are features. The sample ids are in the first row and feature names are in the first column. The expression matrix is usually generated from the original read counts matrix after normalization and batch-effect removal.
Part of the example file:
Class labels
Example filename : data/example/sample_classes.txt
A tab-separated text file with two columns: sample_id, label. The first line is treated as a header. In this example, class labels are disease status of each sample: stage_A, stage_B, stage_C, Normal. All samples except samples with "Normal" labels are from HCC patients.
Part of the example file:
Feature selection and cross-validation with a single combination of parameters
The objective of feature selection is to find a small subset of features that robustly distinguishes between normal and cancer samples. exSEEK provides a script machine_learning.py
for various machine-learning tasks including feature selection, cross validation and prediction.
Command-line usage of machine_learning.py
:
The basic workflow of machine learning is:
Read input data matrix and class labels
machine_learning.py
accepts data matrix and class labels with the --matrix/-i
and --sample-classes
option.
machine_learning.py
assumes that rows are samples and columns are features (different from count matrix or differential expression). We can transpose the data matrix to swap columns and rows if the data matrix is different from this using the option: --transpose
.
Define positive and negative class
The class labels in sample_classes.txt
should contain at least 2 categories. If there are more than 2 categories in class labels, we can specify the positive class and the negative class using the --positive
and --negative
option. Each option can be a single label or a comma-separated list of labels. For example, --positive-class stage_A,stage_B,stage_C --negative-class Normal
.
This defines stage_A, stage_B and stage_C as positive class and Normal as negative class.
Note that samples not belonging to any of the specified labels are removed before further steps.
Specify subset of features (optional)
To use only a subset of features instead of the whole matrix, we can provide a list of features through option --features features.txt
. Each line in features.txt
is a feature name.
The remaining parameters are specified through a configuration file in YAML format through the --config/-c
option. The options described above can also be provided as a YAML configuration file.
YAML is a simple text format that is suitable for exchanging data between programs supported by many programming languages. YAML is more flexible and readable than JSON (does not support comments). For more information about YAML format, please refer to Wikipedia and YAML official site.
Feature preprocessing
Before building a machine learning model, the data matrix passes several steps of preprocessing. There are 4 types of feature preprocessors: scaler, transformer and selector.
Preprocessors as specified in the preprocess_steps
varible in the configuration file. Under preprocess_steps
, each preprocessor are specified as a list item (- example_preprocessor
). The data matrix is preprocessed in the order they are specified in the configuration file.
feature transformation
A transformer transforms all elements in the data matrix using the same function. For expression matrix in RPKM or RPM, we usually transform the matrix by applying the log2 function:
The configuration section for log transformation is:
In this example, log_transform
is the name of the preprocessor. Two keys are required under the log_transform
section: name
and type
:
name
is the name of the predefined transformer. Currently, only one transformer is available: LogTransform.type
is type of the preprocessor.type
can only take values:transformer
,selector
,scaler
.enabled
is optional. The preprocessor is ignored unlessenabled
is set totrue
.params
: extra parameters passed to the preprocessor. ForLogTransform
, two parameters can be specified:base
andpseudo_count
.
feature scaling
A scaler is usually applied independently for each feature. The most commonly used scaler is z-score scaler, which is also called standard scaler.
The configuration section for standard scaler is:
Other available scalers: StandardScaler, RobustScaler, MinMaxScaler, MaxAbsScaler
filter-based feature selector
A selector select a subset of features, which is the core of feature selection.
There is a global configuration variable n_features_to_select
that defines a fixed number feature to select. If the variable is not set and no internal criteria are defined for a selector, all features will be selected.
For a simple filter-based selector, the configuration section is:
The DiffExp_TTest
selector performs differential expression on the data matrix using independent t-test for each feature. Actually, the script invokes differential_expression.R
to generate p-values. The score_type
params transforms the differential expression results of each feature to a score. In this example, the negative log adjusted p-values are used for ranking features. If n_features_to_select
is defined, at most this number of features will be selected.
selector based on differential expression The DiffExpFilter
executes differential_expression.R
to select features. The differential expression script is standalone script that can perform feature selection with various options. Similar to machine_learning.py
, differential_expression.R
accepts an input expression matrix and a sample classes file as input, with positive class and negative class specified with the --positive-class
and --negative-class
option. After finishing differential expression, the script outputs a table with at least three columns: feature_name, log2FoldChange and padj (adjusted p-value).
Currently, the script implemented the following DE methods: deseq2, edger_exact, edger_glmqlf, edger_glmlrt, wilcox (Wilcoxon rank-sum test, or Mann–Whitney U test), limma and ttest (unpaired t-test).
For DESeq2, edgeR and wilcox, the input expression matrix is assumed to be a read counts matrix. For edgeR and limma, a normalization can be specified: RLE, CPM, TMM or upperquartile. DESeq2 uses its own normalization methods. For limma and ttest, the input expression matrix is assumed to normalized and log2-transformed. For wilcox, the input count matrix is normalized and then log2-transformed. To prevent zeros in log function, a pseudo-count is added to every counts, which can be specified by the --pseudo-count
option.
For DESeq2, edgeR and limma, a file containing batch information can be provided by the --batch/-b
option. In DESeq2 and edgeR, batch effects are removed by using batch variables as covariates in negative binomial regression. In limma, batch effects are first removed by linear regression using the removeBatchEffect
function.
Command-line usage of differential_expression.R
:
Example output of differential_expression.R
(using DESeq2):
The score_type
parameter in DiffExpFilter
specifies how to rank features. Three values are possible: neglogp
, pi_value
, log2fc
. pi_value
is a combined score of log2fc
and neglogp
.
wrapper-based feature selector If the feature selector is wrapper-based, it requires a classifier to perform feature selection. The configuration section of an example wrapper-based feature selection is:
This selector is called MaxFeatures, which select features based on feature importance or feature coefficients in the internal classifier. The MaxFeatures selector accepts a parameter n_features_to_select
that limit maximum number of features to select. Other wrapper-based feature selector include: RobustSelector, RFE, RFECV.
A wrapper selector requires classifier
to be specified under the params
key. classifier
is the name of a predefined classifier. machine_learning.py
wraps many classifiers in the scikit-learn Python package with the same name. You can refer to documentation of scikit-learn for parameters of each classifier.
Optionally, additional parameters can be specified in classifier_params
along with classifier
. In this example, because the selector uses SGDClassifier to implement ElasticNet feature selection, the penalty
parameter in classifier_params
is set to elasticnet
.
The splitter
parameter is the name of the splitter that splits samples into training samples and test samples. machine_learning.py
wraps most splitters in the scikit-learn package. You can refer to documentation of scikit-learn for more information.
Train the classifier
After preprocessing, a classifier is trained on the preprocessed matrix. The configuration section is similar to a wrapper-based feature selector:
Note that there are 5 varibles for a classifier: classifier
, classifier_params
, grid_search
, grid_search_params
, sample_weight
. Only classifier
is required.
sample_weight: when set to 'balanced', it computes sample weight using the sklearn.utils.class_weight.compute_sample_weight
function. Balanced sample weights are inversely proportional to n_samples / (n_classes * np.bincount(y))
. This is a simple way to handle imbalanced classes.
Grid search for hyper-parameter optimization
If you need to optimize hyper-parameters of the internal classifier, you can set grid_search
to true
and add additional parameters in grid_search_params
. param_grid
defines a the parameter space to search. Each item in param_grid
specifies a list of possible values for each hyper-parameter. All combinations of hyper-parameters are evaluated and the combination with best metric is selected. The hyper-parameters are optimized on the training samples before being passed to the selector.
For the following example, two hyper-parameters alpha
and l1_ratio
needs to be optimized:
Grid search sets hyper-parameters alpha
and l1_ratio
to [(0.00001, 0.15), (0.001, 0.15), (0.001, 0.15), ..., (0.00001, 0.30)...]
in each iteration.
The scoring
parameter defines the metric for optimizing hyper-paremeter for the classifier. By default, grid search evaluated each combination of hyper-parameters by cross-validation, which can be specified under the cv
key. The performance metrics on test datasets are averaged across cross-validation runs for each hyper-parameter combination. The hyper-parameter combination with the highest performance metric is selected.
Performance evaluation by cross-validation
Cross-validation is the outer loop of the pipeline. During each cross-validation run all samples are splitted into training samples and test samples using a splitter. Feature preprocessing, classifier training and hyper-parameter optimization is only done on training samples. The test samples are used to evaluated the combined classifier.
Configuration parameters for cross-validation is in cv_params
:
cv_params
is similar to cv
variable in grid_search_params
.
An example feature selection run
A complete example of feature selection command:
The output directory is example/output
.
Output files of feature selection
Many files are generated in the output directory after the above command:
features.txt
Selected features. Plain text with one column: feature names
feature_importances.txt
Plain text with two columns: feature name, feature importance
samples.txt
Sample IDs in input matrix selected for feature selection
classes.txt
Sample class labels selected for feature selection
final_model.pkl
Final model fitted on all samples in Python pickle format
metrics.train.txt
Evaluation metrics on training data. First row is metric names. First column is index of each train-test split
metrics.test.txt
Same format with metrics.train.txt
on test data.
cross_validation.h5
Cross-validation details in HDF5 format.
Cross validation details (cross_validation.h5)
feature_selection
(n_splits, n_features)
Binary matrix indicating features selected in each cross-validation split
labels
(n_samples,)
True class labels
predicted_labels
(n_splits, n_samples)
Predicted class labels on all samples
predictions
(n_splits, n_samples)
Predicted probabilities of the positive class (or decision function for SVM)
train_index
(n_splits, n_samples)
Binary matrix indicating training samples in each cross-validation split
Run combinations of selectors and classifiers feature selection in batch
Assume that we have already run normalization
module. The filenames of the expression matrix files are in the following format:
output/$dataset/matrix_processing/${filter}.${normalization}.${batch_removal}.${count_method}.txt
Then we can run feature selection module using the following command:
The feature_selection command tries all combinations of these variables:
n_features_to_select: maximum number of features to select
classifier: classifier to use
selector: feature selection algorithm to use
fold_change_direction: direction for the fold change filter. Possible values: 'any', 'up', 'down'.
You can set possible values for each variable in config/machine_learning.yaml
.
n_features_to_select
Global variable that affects overrides all selectors. If you want to try multiple values, you can set:
selectors
Each item in the selectors
variable is identical to a selector configuration for machine_learning.py
.
classifiers
Each item in classifiers
variable is identical to a classifier configuration for machine_learning.py
.
The complete configuration file can be found in (https://github.com/lulab/exSEEK/blob/master/config/machine_learning.yaml).
Output directories of exSEEK feature selection
Feature selection results using one combination of parameters are saved in a separate directory:
Variables in file patterns
output_dir
Output directory for the dataset, e.g. output/dataset
preprocess_method
Combination of matrix processing methods
count_method
Type of feature counts, e.g. domains_combined
, domains_long
, transcript
, featurecounts
compare_group
Name of the negative-positive class pair defined in compare_groups.yaml
classifier
Classifier defined in the configuration file
n_select
Maximum number of features to select
selector
Feature selection method, e.g. robust
, rfe
fold_change_filter_direction
Direction of fold change for filtering features. Three possible values: up
, down
and any
Summary table of exSEEK feature selection
After finishing exseek.py feature_selection
, a summary directory is created: output/$dataset/summary/cross_validation
Two files are generated: metrics.train.txt
and metrics.test.txt
. metrics.test.txt
contains performance metrics computed on test data.
Example output file contents of metrics.test.txt
:
Summarize metrics as a matrix using Python (Jupyter is recommended):
The above code first select a subset of matrix with the following filters:
compare_group == "Normal-HCC"
: only compares Normal with HCCnormalization == "Norm_RLE"
: use RLE as normalization methodbatch_removal == "Batch_limma_1"
: use limma as batch effect removal method (remove batch 1)n_features == 10
: only select 10 features at maximumcount_method == "mirna_and_domains_rna"
: use miRNA + ncRNA domains as features
Then AUROC (roc_auc
) across cross-validation runs are averaged and create a matrix with classifiers as rows and selectors as columns:
DecisionTree
0.860606
0.904848
0.856061
0.886364
0.862727
0.931818
0.904848
0.880303
0.927879
LogRegL2
0.800000
0.660606
0.794545
0.753333
0.741212
0.627273
0.744242
0.727879
0.693939
MLP
0.847273
0.756970
0.790303
0.790909
0.804848
0.756970
0.701212
0.752121
0.766061
RBFSVM
0.843030
0.903030
0.947273
0.928485
0.952121
0.961818
0.965455
0.916364
0.956970
RandomForest
0.988485
0.978788
0.981212
0.980606
0.979394
0.984545
0.981212
0.985455
0.986061
Similarly, you can summarize AUROC using only miRNA features:
The output table:
DecisionTree
0.735152
0.929394
0.864848
0.933939
0.897879
0.904545
0.900303
0.892121
0.928485
LogRegL2
0.573636
0.743030
0.828485
0.818182
0.716364
0.807879
0.775152
0.828485
0.807879
MLP
0.765455
0.813333
0.823030
0.825455
0.903030
0.881818
0.946667
0.828485
0.859394
RBFSVM
0.816970
0.938788
0.964242
0.979394
0.981818
0.976364
0.983030
0.979394
0.980606
RandomForest
0.954545
0.983636
0.990303
0.983030
0.985758
0.987273
0.992121
0.988788
0.989091
Last updated