
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/feature_selection/plot_rfe_with_cross_validation.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_feature_selection_plot_rfe_with_cross_validation.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_feature_selection_plot_rfe_with_cross_validation.py:


===================================================
Recursive feature elimination with cross-validation
===================================================

A Recursive Feature Elimination (RFE) example with automatic tuning of the
number of features selected with cross-validation.

.. GENERATED FROM PYTHON SOURCE LINES 12-19

Data generation
---------------

We build a classification task using 3 informative features. The introduction
of 2 additional redundant (i.e. correlated) features has the effect that the
selected features vary depending on the cross-validation fold. The remaining
features are non-informative as they are drawn at random.

.. GENERATED FROM PYTHON SOURCE LINES 19-34

.. code-block:: default


    from sklearn.datasets import make_classification

    X, y = make_classification(
        n_samples=500,
        n_features=15,
        n_informative=3,
        n_redundant=2,
        n_repeated=0,
        n_classes=8,
        n_clusters_per_class=1,
        class_sep=0.8,
        random_state=0,
    )








.. GENERATED FROM PYTHON SOURCE LINES 35-40

Model training and selection
----------------------------

We create the RFE object and compute the cross-validated scores. The scoring
strategy "accuracy" optimizes the proportion of correctly classified samples.

.. GENERATED FROM PYTHON SOURCE LINES 40-61

.. code-block:: default


    from sklearn.feature_selection import RFECV
    from sklearn.model_selection import StratifiedKFold
    from sklearn.linear_model import LogisticRegression

    min_features_to_select = 1  # Minimum number of features to consider
    clf = LogisticRegression()
    cv = StratifiedKFold(5)

    rfecv = RFECV(
        estimator=clf,
        step=1,
        cv=cv,
        scoring="accuracy",
        min_features_to_select=min_features_to_select,
        n_jobs=2,
    )
    rfecv.fit(X, y)

    print(f"Optimal number of features: {rfecv.n_features_}")





.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Optimal number of features: 3




.. GENERATED FROM PYTHON SOURCE LINES 62-67

In the present case, the model with 3 features (which corresponds to the true
generative model) is found to be the most optimal.

Plot number of features VS. cross-validation scores
---------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 67-82

.. code-block:: default


    import matplotlib.pyplot as plt

    n_scores = len(rfecv.cv_results_["mean_test_score"])
    plt.figure()
    plt.xlabel("Number of features selected")
    plt.ylabel("Mean test accuracy")
    plt.errorbar(
        range(min_features_to_select, n_scores + min_features_to_select),
        rfecv.cv_results_["mean_test_score"],
        yerr=rfecv.cv_results_["std_test_score"],
    )
    plt.title("Recursive Feature Elimination \nwith correlated features")
    plt.show()




.. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_rfe_with_cross_validation_001.png
   :alt: Recursive Feature Elimination  with correlated features
   :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_rfe_with_cross_validation_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 83-90

From the plot above one can further notice a plateau of equivalent scores
(similar mean value and overlapping errorbars) for 3 to 5 selected features.
This is the result of introducing correlated features. Indeed, the optimal
model selected by the RFE can lie within this range, depending on the
cross-validation technique. The test accuracy decreases above 5 selected
features, this is, keeping non-informative features leads to over-fitting and
is therefore detrimental for the statistical performance of the models.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.681 seconds)


.. _sphx_glr_download_auto_examples_feature_selection_plot_rfe_with_cross_validation.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example



  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_rfe_with_cross_validation.py <plot_rfe_with_cross_validation.py>`



  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_rfe_with_cross_validation.ipynb <plot_rfe_with_cross_validation.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
