Distributed feature selection - distfeatselect¶

Introduction¶

This algorithm is designed to perform feature selection on a given dataset using distributed scheme logic [1]. The goal is to help users streamline their feature selection process, potentially improving model performance and reducing dimensionality.

Overview¶

The custom feature selection algorithm DFS is implemented as a scikit-learn-compatible transformer, following the scikit-learn API standards. It can be seamlessly integrated into scikit-learn workflows, allowing users to incorporate distributed feature selection into their machine learning pipelines.

Usage¶

In this section, we will demonstrate how to use the dfs package and some of its functionality. This example uses the Breast Cancer Wisconsin (Diagnostic) Data Set (WDBC) dataset available here.

First, we import several essential Python libraries.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

We install and import distfeatselect package.

In [2]:
pip install distfeatselect
Collecting distfeatselect
  Downloading distfeatselect-0.1.0-py3-none-any.whl (18 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (1.25.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (1.5.3)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (0.14.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from distfeatselect) (1.2.2)
Collecting dcor (from distfeatselect)
  Downloading dcor-0.6-py3-none-any.whl (55 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.5/55.5 kB 1.1 MB/s eta 0:00:00
Requirement already satisfied: numba>=0.51 in /usr/local/lib/python3.10/dist-packages (from dcor->distfeatselect) (0.58.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from dcor->distfeatselect) (1.11.4)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from dcor->distfeatselect) (1.3.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->distfeatselect) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->distfeatselect) (2023.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->distfeatselect) (3.3.0)
Requirement already satisfied: patsy>=0.5.4 in /usr/local/lib/python3.10/dist-packages (from statsmodels->distfeatselect) (0.5.6)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels->distfeatselect) (23.2)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51->dcor->distfeatselect) (0.41.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.4->statsmodels->distfeatselect) (1.16.0)
Installing collected packages: dcor, distfeatselect
Successfully installed dcor-0.6 distfeatselect-0.1.0
In [27]:
from distfeatselect import utils
from distfeatselect import dfs
from distfeatselect import rfs

Then we'll import the data:

In [4]:
pip install ucimlrepo
Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3
In [5]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets
In [6]:
y["Diagnosis"].replace({'M': 1, 'B': 0}, inplace=True)
<ipython-input-6-366129d1ada6>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y["Diagnosis"].replace({'M': 1, 'B': 0}, inplace=True)

We can preview the dataset:

In [7]:
X.head()
Out[7]:
radius1 texture1 perimeter1 area1 smoothness1 compactness1 concavity1 concave_points1 symmetry1 fractal_dimension1 ... radius3 texture3 perimeter3 area3 smoothness3 compactness3 concavity3 concave_points3 symmetry3 fractal_dimension3
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

As an additional step, we highly recommend performing correlation analysis and removing desired correlated features. In this example, a correlation heatmap is plotted to find highly correlated features and remove them.

In [8]:
import seaborn as sns # seaborn not included as dfsc dependency
import matplotlib.pyplot as plt #pyplot not included as dfsc dependency

plt.figure(figsize=(20,20))

# Generate a mask to onlyshow the bottom triangle
mask = np.triu(np.ones_like(X.corr(), dtype=bool))

# generate heatmap
sns.heatmap(X.corr(), annot=True, mask=mask, vmin=-1, vmax=1, cmap='magma')
plt.show()

Using information from the heatmap we can optionally remove all the highly correlated features.

In [9]:
#coldrop = ['radius1', 'concave_points1'] identify all the features that are highly correlated and add them to the list of features to be dropped
#X = X.drop(columns=coldrop)

The data is then segmented into training and testing portions to be utilized by DFSC. This standard process is carried out using scikit-learn's train_test_split function.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Preprocessing - we can incorporate various preprocessing steps using standard methods and packages such as sklearn.preprocessing. Here, we scaled the data using MinMaxScaler.

In [11]:
minmax = MinMaxScaler(feature_range=(0, 1))
X_train = minmax.fit_transform(X_train)
X_test = minmax.fit_transform(X_test)

We convert the column vectors y_train and y_test into 1D arrays. This step is optional, but if skipped, you will receive warnings

In [12]:
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

Example of typical usage¶

DFSC follows the sklearn model API. We create an instance of the DFSC class and then call its methods.

We fit the model by instantiating a new DFSC object. All wanted settings are passed into the constructor. Then you call its fit method and pass in the training data and corresponding labels.

More details about settings (the parameters) available for the DFSC constructor are available in the docstrings, for example, via help(DFSC), help(DFSC.fit) ...

In [79]:
dfs_model = dfs.DFS(n_vbins=2, n_hbins=1, n_runs=5, redistribute_features=True)

fitted_model = dfs_model.fit(X_train, y_train)
New best model for hbin 0. roc_auc=0.92595 -- Model features [20, 23, 26]
All horizontal partitions have converged. Final iter count: 3
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.2794015102275302
            Iterations: 73
            Function evaluations: 73
            Gradient evaluations: 73

Now, we can get a mask, or integer index, of the features selected. We call get_support method. Here, we call it with True as a parameter and the return value will be an array of integers.

In [80]:
selected_features = dfs_model.get_support(True)

If we want to reduce the dataset to the selected features, we call transform method.

In [81]:
X_train_transformed = dfs_model.transform(X_train)
X_test_transformed = dfs_model.transform(X_test)

We can now do a standard process for the task of classification using whatever classifier and metrics we want.

In [82]:
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_transformed, y_train)

y_pred = svm_classifier.predict(X_test_transformed)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)

print(f"Selected features {selected_features}\n")
print(f"Accuracy {accuracy} \n")
Selected features [20, 23, 26]

Accuracy 0.956140350877193 

Example of using Random Feature Selection (RFSC)¶

RFSC as a base (local) feature selection algorithm inside distributed scheme (DFSC)

RFSC can be used as a feature selection algorithm on its own or as shown below as a base fs algorithm for DFSC.

First, we instantiate a new RFSC object using distance correlation as a method. We also set number of models as 300, number of iterations as 100 and rip threshold for feature inclusion in the final model as 0.9.

More details about settings (the parameters) available for RFSC constructor are available in the docstrings, for example, via help(RFSC).

In [90]:
rfs_model = rfs.RFS(n_models=300, n_iters=100, rip_cutoff=0.9, method = "dcor")

Now, we set rfsc as a local_fs_method in new DFSC object and continue the same as shown in previous example. Instead of rfsc, we could also set local_fs_method to be some other custom or wellknown feature selection algorithm. The only prerequisite for that is that the algorithm follows the sklearn API standards.

In [91]:
dfs_model = dfs.DFS(n_vbins=3, n_hbins= 2, n_runs= 10, max_processes= 6, local_fs_method= rfs_model, redistribute_features= True)

fitted_model = dfs_model.fit(X_train, y_train)
features = dfs_model.get_support(True)
New best model for hbin 1. roc_auc=0.85852 -- Model features [0, 2, 10, 27]
New best model for hbin 0. roc_auc=0.86222 -- Model features [2, 27]
New best model for hbin 0. roc_auc=0.86385 -- Model features [7, 20]
New best model for hbin 1. roc_auc=0.87424 -- Model features [7, 12, 20, 22]
New best model for hbin 0. roc_auc=0.8765 -- Model features [7, 20, 22, 27]
New best model for hbin 1. roc_auc=0.87478 -- Model features [7, 10, 20, 22]
All horizontal partitions have converged. Final iter count: 4
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.5410368798991855
            Iterations: 68
            Function evaluations: 68
            Gradient evaluations: 68

To reduce dataset to the selected features, we call transform method.

In [92]:
X_train_transformed = dfs_model.transform(X_train)
X_test_transformed = dfs_model.transform(X_test)

We can now do standard process for the task of classification using whatever classifier and metrics we want.

In [93]:
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_transformed, y_train)

y_pred = svm_classifier.predict(X_test_transformed)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)

print(f"Selected features {selected_features}\n")
print(f"Accuracy {accuracy} \n")
Selected features [20, 23, 26]

Accuracy 0.9649122807017544 

Example of using the algorithm with sklearn.pipeline.Pipeline and sklearn.model_selection.GridSearchCV¶

We can also incorporate either of this two algorithms RFSC or DFSC into sklearn pipeline and do the hyperparameters tuning using GridSearchCV as shown in the following example.

In [73]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

Define the parameter grid for hyperparameter tuning.

In [74]:
param_grid = {
    'custom_feature_selector__n_iters': [1, 5, 10],
    'custom_feature_selector__n_models': [2, 100]
}

Create the pipeline with RFSC as a custom feature selector.

In [75]:
custom_feature_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('custom_feature_selector', rfs.RFS()),
    ('classifier', LogisticRegression())
])

Create the GridSearchCV object.

In [76]:
grid_search = GridSearchCV(custom_feature_pipeline, param_grid, cv=5)

Fit the GridSearchCV object to the data

In [77]:
grid_search.fit(X_train, y_train)
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.2264559198002898
            Iterations: 27
            Function evaluations: 28
            Gradient evaluations: 27
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.3198646585000102
            Iterations: 10
            Function evaluations: 10
            Gradient evaluations: 10
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.15824296432541482
            Iterations: 25
            Function evaluations: 25
            Gradient evaluations: 25
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.1799385048345915
            Iterations: 26
            Function evaluations: 26
            Gradient evaluations: 26
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.20840828503781075
            Iterations: 24
            Function evaluations: 24
            Gradient evaluations: 24
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.04815555354726189
            Iterations: 368
            Function evaluations: 368
            Gradient evaluations: 368
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.06683737157217241
            Iterations: 314
            Function evaluations: 314
            Gradient evaluations: 314
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.028636338442135426
            Iterations: 348
            Function evaluations: 348
            Gradient evaluations: 348
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.05210185917690954
            Iterations: 344
            Function evaluations: 345
            Gradient evaluations: 344
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.04165635015245378
            Iterations: 348
            Function evaluations: 349
            Gradient evaluations: 348
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.41521867893464354
            Iterations: 20
            Function evaluations: 20
            Gradient evaluations: 20
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.18587487708913633
            Iterations: 27
            Function evaluations: 27
            Gradient evaluations: 27
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.1884076655990143
            Iterations: 32
            Function evaluations: 32
            Gradient evaluations: 32
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.3273033980189395
            Iterations: 23
            Function evaluations: 23
            Gradient evaluations: 23
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.13532650946767144
            Iterations: 30
            Function evaluations: 30
            Gradient evaluations: 30
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.058044809022946016
            Iterations: 218
            Function evaluations: 218
            Gradient evaluations: 218
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.05993659094554035
            Iterations: 310
            Function evaluations: 310
            Gradient evaluations: 310
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.07254476499238063
            Iterations: 242
            Function evaluations: 242
            Gradient evaluations: 242
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.06140187846556093
            Iterations: 267
            Function evaluations: 267
            Gradient evaluations: 267
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.06855497917838979
            Iterations: 194
            Function evaluations: 194
            Gradient evaluations: 194
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.1640885577818149
            Iterations: 24
            Function evaluations: 24
            Gradient evaluations: 24
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.21761134990520375
            Iterations: 27
            Function evaluations: 28
            Gradient evaluations: 27
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.28674168677326023
            Iterations: 23
            Function evaluations: 23
            Gradient evaluations: 23
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.1359138712232599
            Iterations: 30
            Function evaluations: 30
            Gradient evaluations: 30
Tol reached. Number of features above rip_cutoff is 2
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.14981182223707176
            Iterations: 27
            Function evaluations: 27
            Gradient evaluations: 27
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.12633424060168888
            Iterations: 55
            Function evaluations: 55
            Gradient evaluations: 55
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.13339962232102096
            Iterations: 73
            Function evaluations: 73
            Gradient evaluations: 73
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.04200354978783453
            Iterations: 255
            Function evaluations: 255
            Gradient evaluations: 255
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.10619948022310698
            Iterations: 200
            Function evaluations: 200
            Gradient evaluations: 200
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.10738086932301721
            Iterations: 83
            Function evaluations: 84
            Gradient evaluations: 83
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.050314001611967446
            Iterations: 340
            Function evaluations: 340
            Gradient evaluations: 340
Out[77]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('custom_feature_selector', RFS()),
                                       ('classifier', LogisticRegression())]),
             param_grid={'custom_feature_selector__n_iters': [1, 5, 10],
                         'custom_feature_selector__n_models': [2, 100]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('custom_feature_selector', RFS()),
                                       ('classifier', LogisticRegression())]),
             param_grid={'custom_feature_selector__n_iters': [1, 5, 10],
                         'custom_feature_selector__n_models': [2, 100]})
Pipeline(steps=[('scaler', StandardScaler()),
                ('custom_feature_selector', RFS()),
                ('classifier', LogisticRegression())])
StandardScaler()
RFS(n_models=100, n_iters=300, 
                method = logit, estimator = SVC(kernel='linear'), tuning=10, 
                metric=roc_auc, alpha=0.99), mu_init = None,  tol = 0.002, 
                rip_cutoff = 1
SVC(kernel='linear')
SVC(kernel='linear')
LogisticRegression()

Get the best parameters and best score from the grid search.

In [78]:
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)
Best Parameters: {'custom_feature_selector__n_iters': 1, 'custom_feature_selector__n_models': 100}
Best Score: 0.9736263736263737

[1] Brankovic, A., Piroddi, L. (2019). A distributed feature selection scheme with partial information sharing