This algorithm is designed to perform feature selection on a given dataset using distributed scheme logic [1]. The goal is to help users streamline their feature selection process, potentially improving model performance and reducing dimensionality.
The custom feature selection algorithm DFS is implemented as a scikit-learn-compatible transformer, following the scikit-learn API standards. It can be seamlessly integrated into scikit-learn workflows, allowing users to incorporate distributed feature selection into their machine learning pipelines.
In this section, we will demonstrate how to use the dfs package and some of its functionality. This example uses the Breast Cancer Wisconsin (Diagnostic) Data Set (WDBC) dataset available here.
First, we import several essential Python libraries.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
We install and import distfeatselect package.
pip install distfeatselect
Collecting distfeatselect
Downloading distfeatselect-0.1.0-py3-none-any.whl (18 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (1.25.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (1.5.3)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (0.14.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (1.2.2)
Collecting dcor (from distfeatselect)
Downloading dcor-0.6-py3-none-any.whl (55 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.5/55.5 kB 1.1 MB/s eta 0:00:00
Requirement already satisfied: numba>=0.51 in /usr/local/lib/python3.10/dist-packages (from dcor->distfeatselect) (0.58.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from dcor->distfeatselect) (1.11.4)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from dcor->distfeatselect) (1.3.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->distfeatselect) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->distfeatselect) (2023.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->distfeatselect) (3.3.0)
Requirement already satisfied: patsy>=0.5.4 in /usr/local/lib/python3.10/dist-packages (from statsmodels->distfeatselect) (0.5.6)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels->distfeatselect) (23.2)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51->dcor->distfeatselect) (0.41.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.4->statsmodels->distfeatselect) (1.16.0)
Installing collected packages: dcor, distfeatselect
Successfully installed dcor-0.6 distfeatselect-0.1.0
from distfeatselect import utils
from distfeatselect import dfs
from distfeatselect import rfs
Then we'll import the data:
pip install ucimlrepo
Collecting ucimlrepo Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB) Installing collected packages: ucimlrepo Successfully installed ucimlrepo-0.0.3
from ucimlrepo import fetch_ucirepo
# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)
# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets
y["Diagnosis"].replace({'M': 1, 'B': 0}, inplace=True)
<ipython-input-6-366129d1ada6>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
y["Diagnosis"].replace({'M': 1, 'B': 0}, inplace=True)
We can preview the dataset:
X.head()
| radius1 | texture1 | perimeter1 | area1 | smoothness1 | compactness1 | concavity1 | concave_points1 | symmetry1 | fractal_dimension1 | ... | radius3 | texture3 | perimeter3 | area3 | smoothness3 | compactness3 | concavity3 | concave_points3 | symmetry3 | fractal_dimension3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
As an additional step, we highly recommend performing correlation analysis and removing desired correlated features. In this example, a correlation heatmap is plotted to find highly correlated features and remove them.
import seaborn as sns # seaborn not included as dfsc dependency
import matplotlib.pyplot as plt #pyplot not included as dfsc dependency
plt.figure(figsize=(20,20))
# Generate a mask to onlyshow the bottom triangle
mask = np.triu(np.ones_like(X.corr(), dtype=bool))
# generate heatmap
sns.heatmap(X.corr(), annot=True, mask=mask, vmin=-1, vmax=1, cmap='magma')
plt.show()
Using information from the heatmap we can optionally remove all the highly correlated features.
#coldrop = ['radius1', 'concave_points1'] identify all the features that are highly correlated and add them to the list of features to be dropped
#X = X.drop(columns=coldrop)
The data is then segmented into training and testing portions to be utilized by DFSC. This standard process is carried out using scikit-learn's train_test_split function.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
Preprocessing - we can incorporate various preprocessing steps using standard methods and packages such as sklearn.preprocessing. Here, we scaled the data using MinMaxScaler.
minmax = MinMaxScaler(feature_range=(0, 1))
X_train = minmax.fit_transform(X_train)
X_test = minmax.fit_transform(X_test)
We convert the column vectors y_train and y_test into 1D arrays. This step is optional, but if skipped, you will receive warnings
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
DFSC follows the sklearn model API. We create an instance of the DFSC class and then call its methods.
We fit the model by instantiating a new DFSC object. All wanted settings are passed into the constructor. Then you call its fit method and pass in the training data and corresponding labels.
More details about settings (the parameters) available for the DFSC constructor are available in the docstrings, for example, via help(DFSC), help(DFSC.fit) ...
dfs_model = dfs.DFS(n_vbins=2, n_hbins=1, n_runs=5, redistribute_features=True)
fitted_model = dfs_model.fit(X_train, y_train)
New best model for hbin 0. roc_auc=0.92595 -- Model features [20, 23, 26]
All horizontal partitions have converged. Final iter count: 3
Optimization terminated successfully (Exit mode 0)
Current function value: 0.2794015102275302
Iterations: 73
Function evaluations: 73
Gradient evaluations: 73
Now, we can get a mask, or integer index, of the features selected. We call get_support method. Here, we call it with True as a parameter and the return value will be an array of integers.
selected_features = dfs_model.get_support(True)
If we want to reduce the dataset to the selected features, we call transform method.
X_train_transformed = dfs_model.transform(X_train)
X_test_transformed = dfs_model.transform(X_test)
We can now do a standard process for the task of classification using whatever classifier and metrics we want.
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_transformed, y_train)
y_pred = svm_classifier.predict(X_test_transformed)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Selected features {selected_features}\n")
print(f"Accuracy {accuracy} \n")
Selected features [20, 23, 26] Accuracy 0.956140350877193
RFSC as a base (local) feature selection algorithm inside distributed scheme (DFSC)
RFSC can be used as a feature selection algorithm on its own or as shown below as a base fs algorithm for DFSC.
First, we instantiate a new RFSC object using distance correlation as a method. We also set number of models as 300, number of iterations as 100 and rip threshold for feature inclusion in the final model as 0.9.
More details about settings (the parameters) available for RFSC constructor are available in the docstrings, for example, via help(RFSC).
rfs_model = rfs.RFS(n_models=300, n_iters=100, rip_cutoff=0.9, method = "dcor")
Now, we set rfsc as a local_fs_method in new DFSC object and continue the same as shown in previous example. Instead of rfsc, we could also set local_fs_method to be some other custom or wellknown feature selection algorithm. The only prerequisite for that is that the algorithm follows the sklearn API standards.
dfs_model = dfs.DFS(n_vbins=3, n_hbins= 2, n_runs= 10, max_processes= 6, local_fs_method= rfs_model, redistribute_features= True)
fitted_model = dfs_model.fit(X_train, y_train)
features = dfs_model.get_support(True)
New best model for hbin 1. roc_auc=0.85852 -- Model features [0, 2, 10, 27]
New best model for hbin 0. roc_auc=0.86222 -- Model features [2, 27]
New best model for hbin 0. roc_auc=0.86385 -- Model features [7, 20]
New best model for hbin 1. roc_auc=0.87424 -- Model features [7, 12, 20, 22]
New best model for hbin 0. roc_auc=0.8765 -- Model features [7, 20, 22, 27]
New best model for hbin 1. roc_auc=0.87478 -- Model features [7, 10, 20, 22]
All horizontal partitions have converged. Final iter count: 4
Optimization terminated successfully (Exit mode 0)
Current function value: 0.5410368798991855
Iterations: 68
Function evaluations: 68
Gradient evaluations: 68
To reduce dataset to the selected features, we call transform method.
X_train_transformed = dfs_model.transform(X_train)
X_test_transformed = dfs_model.transform(X_test)
We can now do standard process for the task of classification using whatever classifier and metrics we want.
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_transformed, y_train)
y_pred = svm_classifier.predict(X_test_transformed)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Selected features {selected_features}\n")
print(f"Accuracy {accuracy} \n")
Selected features [20, 23, 26] Accuracy 0.9649122807017544
We can also incorporate either of this two algorithms RFSC or DFSC into sklearn pipeline and do the hyperparameters tuning using GridSearchCV as shown in the following example.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
Define the parameter grid for hyperparameter tuning.
param_grid = {
'custom_feature_selector__n_iters': [1, 5, 10],
'custom_feature_selector__n_models': [2, 100]
}
Create the pipeline with RFSC as a custom feature selector.
custom_feature_pipeline = Pipeline([
('scaler', StandardScaler()),
('custom_feature_selector', rfs.RFS()),
('classifier', LogisticRegression())
])
Create the GridSearchCV object.
grid_search = GridSearchCV(custom_feature_pipeline, param_grid, cv=5)
Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)
Optimization terminated successfully (Exit mode 0)
Current function value: 0.2264559198002898
Iterations: 27
Function evaluations: 28
Gradient evaluations: 27
Optimization terminated successfully (Exit mode 0)
Current function value: 0.3198646585000102
Iterations: 10
Function evaluations: 10
Gradient evaluations: 10
Optimization terminated successfully (Exit mode 0)
Current function value: 0.15824296432541482
Iterations: 25
Function evaluations: 25
Gradient evaluations: 25
Optimization terminated successfully (Exit mode 0)
Current function value: 0.1799385048345915
Iterations: 26
Function evaluations: 26
Gradient evaluations: 26
Optimization terminated successfully (Exit mode 0)
Current function value: 0.20840828503781075
Iterations: 24
Function evaluations: 24
Gradient evaluations: 24
Optimization terminated successfully (Exit mode 0)
Current function value: 0.04815555354726189
Iterations: 368
Function evaluations: 368
Gradient evaluations: 368
Optimization terminated successfully (Exit mode 0)
Current function value: 0.06683737157217241
Iterations: 314
Function evaluations: 314
Gradient evaluations: 314
Optimization terminated successfully (Exit mode 0)
Current function value: 0.028636338442135426
Iterations: 348
Function evaluations: 348
Gradient evaluations: 348
Optimization terminated successfully (Exit mode 0)
Current function value: 0.05210185917690954
Iterations: 344
Function evaluations: 345
Gradient evaluations: 344
Optimization terminated successfully (Exit mode 0)
Current function value: 0.04165635015245378
Iterations: 348
Function evaluations: 349
Gradient evaluations: 348
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.41521867893464354
Iterations: 20
Function evaluations: 20
Gradient evaluations: 20
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.18587487708913633
Iterations: 27
Function evaluations: 27
Gradient evaluations: 27
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.1884076655990143
Iterations: 32
Function evaluations: 32
Gradient evaluations: 32
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.3273033980189395
Iterations: 23
Function evaluations: 23
Gradient evaluations: 23
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.13532650946767144
Iterations: 30
Function evaluations: 30
Gradient evaluations: 30
Optimization terminated successfully (Exit mode 0)
Current function value: 0.058044809022946016
Iterations: 218
Function evaluations: 218
Gradient evaluations: 218
Optimization terminated successfully (Exit mode 0)
Current function value: 0.05993659094554035
Iterations: 310
Function evaluations: 310
Gradient evaluations: 310
Optimization terminated successfully (Exit mode 0)
Current function value: 0.07254476499238063
Iterations: 242
Function evaluations: 242
Gradient evaluations: 242
Optimization terminated successfully (Exit mode 0)
Current function value: 0.06140187846556093
Iterations: 267
Function evaluations: 267
Gradient evaluations: 267
Optimization terminated successfully (Exit mode 0)
Current function value: 0.06855497917838979
Iterations: 194
Function evaluations: 194
Gradient evaluations: 194
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.1640885577818149
Iterations: 24
Function evaluations: 24
Gradient evaluations: 24
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.21761134990520375
Iterations: 27
Function evaluations: 28
Gradient evaluations: 27
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.28674168677326023
Iterations: 23
Function evaluations: 23
Gradient evaluations: 23
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.1359138712232599
Iterations: 30
Function evaluations: 30
Gradient evaluations: 30
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully (Exit mode 0)
Current function value: 0.14981182223707176
Iterations: 27
Function evaluations: 27
Gradient evaluations: 27
Optimization terminated successfully (Exit mode 0)
Current function value: 0.12633424060168888
Iterations: 55
Function evaluations: 55
Gradient evaluations: 55
Optimization terminated successfully (Exit mode 0)
Current function value: 0.13339962232102096
Iterations: 73
Function evaluations: 73
Gradient evaluations: 73
Optimization terminated successfully (Exit mode 0)
Current function value: 0.04200354978783453
Iterations: 255
Function evaluations: 255
Gradient evaluations: 255
Optimization terminated successfully (Exit mode 0)
Current function value: 0.10619948022310698
Iterations: 200
Function evaluations: 200
Gradient evaluations: 200
Optimization terminated successfully (Exit mode 0)
Current function value: 0.10738086932301721
Iterations: 83
Function evaluations: 84
Gradient evaluations: 83
Optimization terminated successfully (Exit mode 0)
Current function value: 0.050314001611967446
Iterations: 340
Function evaluations: 340
Gradient evaluations: 340
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('custom_feature_selector', RFS()),
('classifier', LogisticRegression())]),
param_grid={'custom_feature_selector__n_iters': [1, 5, 10],
'custom_feature_selector__n_models': [2, 100]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('custom_feature_selector', RFS()),
('classifier', LogisticRegression())]),
param_grid={'custom_feature_selector__n_iters': [1, 5, 10],
'custom_feature_selector__n_models': [2, 100]})Pipeline(steps=[('scaler', StandardScaler()),
('custom_feature_selector', RFS()),
('classifier', LogisticRegression())])StandardScaler()
RFS(n_models=100, n_iters=300,
method = logit, estimator = SVC(kernel='linear'), tuning=10,
metric=roc_auc, alpha=0.99), mu_init = None, tol = 0.002,
rip_cutoff = 1SVC(kernel='linear')
SVC(kernel='linear')
LogisticRegression()
Get the best parameters and best score from the grid search.
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best Parameters:", best_params)
print("Best Score:", best_score)
Best Parameters: {'custom_feature_selector__n_iters': 1, 'custom_feature_selector__n_models': 100}
Best Score: 0.9736263736263737
[1] Brankovic, A., Piroddi, L. (2019). A distributed feature selection scheme with partial information sharing