Heart Disease Data

This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. This database includes 76 attributes, but all published studies relate to the use of a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date. One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

Hint: heart_disease_uci.csv


Instructions:
-------------
1. Use Lifecycle of Data Sciece 
2. Use necessary data Preprocess techniques 
3. Use various Regression and Classification techniques for comparision
4. Use metrics for regression and classification when needed.
5. Use variosu Pipeline/Hyperparametr tuning techniques for improving performance

In [None]:
import numpy as np
import pandas as pd
import pandas_profiling as pp
import math
import random
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# preprocessing
import sklearn
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, confusion_matrix, explained_variance_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel, SelectKBest, RFE, chi2

# models
from sklearn.linear_model import LassoCV
from sklearn.svm import LinearSVC
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings("ignore")

In [None]:
random_state = 42

In [None]:
data = pd.read_csv("/content/heart_disease_uci.csv")

In [None]:
data.head(3)

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 115.1+ KB


In [None]:
data['target'] = data['num']
data = data.drop(columns=['id', 'dataset', 'ca', 'thal', 'num'])

In [None]:
data = data[data['target'].isin([0, 1])]
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,target
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0
5,56,Male,atypical angina,120.0,236.0,False,normal,178.0,False,0.8,upsloping,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,62,Male,asymptomatic,158.0,170.0,False,st-t abnormality,138.0,True,0.0,,1
915,54,Female,asymptomatic,127.0,333.0,True,st-t abnormality,154.0,False,0.0,,1
916,62,Male,typical angina,,139.0,False,st-t abnormality,,,,,0
918,58,Male,asymptomatic,,385.0,True,lv hypertrophy,,,,,0


In [None]:
data.describe([.05, .95])

Unnamed: 0,age,trestbps,chol,thalch,oldpeak,target
count,676.0,643.0,650.0,643.0,640.0,676.0
mean,51.715976,131.068429,214.946154,141.838258,0.645938,0.392012
std,9.276611,18.137884,99.125025,25.059654,0.900312,0.488561
min,28.0,80.0,0.0,69.0,-2.6,0.0
5%,36.0,105.0,0.0,98.0,0.0,0.0
50%,53.0,130.0,228.5,143.0,0.0,0.0
95%,66.0,160.0,338.55,180.0,2.305,1.0
max,76.0,200.0,603.0,202.0,5.0,1.0


In [None]:
data.describe([.01, .99])

Unnamed: 0,age,trestbps,chol,thalch,oldpeak,target
count,676.0,643.0,650.0,643.0,640.0,676.0
mean,51.715976,131.068429,214.946154,141.838258,0.645938,0.392012
std,9.276611,18.137884,99.125025,25.059654,0.900312,0.488561
min,28.0,80.0,0.0,69.0,-2.6,0.0
1%,31.75,95.42,0.0,86.42,-0.5,0.0
50%,53.0,130.0,228.5,143.0,0.0,0.0
99%,73.25,180.0,462.08,188.0,3.0,1.0
max,76.0,200.0,603.0,202.0,5.0,1.0


In [None]:
data = data[(data['chol'] <= 420) & (data['oldpeak'] >=0) & (data['oldpeak'] <=4)].reset_index(drop=True)
data = data.dropna().reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362 entries, 0 to 361
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       362 non-null    int64  
 1   sex       362 non-null    object 
 2   cp        362 non-null    object 
 3   trestbps  362 non-null    float64
 4   chol      362 non-null    float64
 5   fbs       362 non-null    object 
 6   restecg   362 non-null    object 
 7   thalch    362 non-null    float64
 8   exang     362 non-null    object 
 9   oldpeak   362 non-null    float64
 10  slope     362 non-null    object 
 11  target    362 non-null    int64  
dtypes: float64(4), int64(2), object(6)
memory usage: 34.1+ KB


In [None]:
data.describe()

Unnamed: 0,age,trestbps,chol,thalch,oldpeak,target
count,362.0,362.0,362.0,362.0,362.0,362.0
mean,53.049724,132.085635,230.179558,143.469613,0.984254,0.439227
std,8.803976,17.740539,78.592559,25.301187,0.898136,0.49698
min,29.0,92.0,0.0,82.0,0.0,0.0
25%,47.0,120.0,204.0,124.0,0.0,0.0
50%,54.0,130.0,235.5,147.0,1.0,0.0
75%,59.0,140.0,273.0,162.0,1.5,1.0
max,76.0,200.0,417.0,202.0,4.0,1.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362 entries, 0 to 361
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       362 non-null    int64  
 1   sex       362 non-null    object 
 2   cp        362 non-null    object 
 3   trestbps  362 non-null    float64
 4   chol      362 non-null    float64
 5   fbs       362 non-null    object 
 6   restecg   362 non-null    object 
 7   thalch    362 non-null    float64
 8   exang     362 non-null    object 
 9   oldpeak   362 non-null    float64
 10  slope     362 non-null    object 
 11  target    362 non-null    int64  
dtypes: float64(4), int64(2), object(6)
memory usage: 34.1+ KB


In [None]:
def str_features_to_numeric(data):
    # Transforms all string features of the df to numeric features
    
    # Determination categorical features
    categorical_columns = []
    numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    features = data.columns.values.tolist()
    for col in features:
        if data[col].dtype in numerics: continue
        categorical_columns.append(col)
    
    # Encoding categorical features
    for col in categorical_columns:
        if col in data.columns:
            le = LabelEncoder()
            le.fit(list(data[col].astype(str).values))
            data[col] = le.transform(list(data[col].astype(str).values))
    
    return data

In [None]:
# Transform all string features of the df to numeric features
data = str_features_to_numeric(data)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362 entries, 0 to 361
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       362 non-null    int64  
 1   sex       362 non-null    int64  
 2   cp        362 non-null    int64  
 3   trestbps  362 non-null    float64
 4   chol      362 non-null    float64
 5   fbs       362 non-null    int64  
 6   restecg   362 non-null    int64  
 7   thalch    362 non-null    float64
 8   exang     362 non-null    int64  
 9   oldpeak   362 non-null    float64
 10  slope     362 non-null    int64  
 11  target    362 non-null    int64  
dtypes: float64(4), int64(8)
memory usage: 34.1 KB


In [None]:
data.describe()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,target
count,362.0,362.0,362.0,362.0,362.0,362.0,362.0,362.0,362.0,362.0,362.0,362.0
mean,53.049724,0.698895,0.895028,132.085635,230.179558,0.132597,0.812155,143.469613,0.41989,0.984254,1.356354,0.439227
std,8.803976,0.459373,1.001399,17.740539,78.592559,0.339608,0.598304,25.301187,0.494224,0.898136,0.574216,0.49698
min,29.0,0.0,0.0,92.0,0.0,0.0,0.0,82.0,0.0,0.0,0.0,0.0
25%,47.0,0.0,0.0,120.0,204.0,0.0,0.0,124.0,0.0,0.0,1.0,0.0
50%,54.0,1.0,1.0,130.0,235.5,0.0,1.0,147.0,0.0,1.0,1.0,0.0
75%,59.0,1.0,2.0,140.0,273.0,0.0,1.0,162.0,1.0,1.5,2.0,1.0
max,76.0,1.0,3.0,200.0,417.0,1.0,2.0,202.0,1.0,4.0,2.0,1.0


In [None]:
def fe_creation(df):
    df['age2'] = df['age']//10
    df['trestbps2'] = df['trestbps']//10
    df['chol2'] = df['chol']//60
    df['thalch2'] = df['thalch']//40
    df['oldpeak2'] = df['oldpeak']//0.4
    for i in ['sex', 'age2', 'fbs', 'restecg', 'exang']:
        for j in ['cp','trestbps2', 'chol2', 'thalch2', 'oldpeak2', 'slope']:
            df[i + "_" + j] = df[i].astype('str') + "_" + df[j].astype('str')
    return df

data = fe_creation(data)

In [None]:
pd.set_option('max_columns', len(data.columns)+1)
len(data.columns)

47

In [None]:
# Transform all string features of the df to numeric features
data = str_features_to_numeric(data)
data.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,target,age2,trestbps2,chol2,thalch2,oldpeak2,sex_cp,sex_trestbps2,sex_chol2,sex_thalch2,sex_oldpeak2,sex_slope,age2_cp,age2_trestbps2,age2_chol2,age2_thalch2,age2_oldpeak2,age2_slope,fbs_cp,fbs_trestbps2,fbs_chol2,fbs_thalch2,fbs_oldpeak2,fbs_slope,restecg_cp,restecg_trestbps2,restecg_chol2,restecg_thalch2,restecg_oldpeak2,restecg_slope,exang_cp,exang_trestbps2,exang_chol2,exang_thalch2,exang_oldpeak2,exang_slope
0,63,1,3,145.0,233.0,1,0,150.0,0,2.3,0,0,6,14.0,3.0,3.0,5.0,7,14,9,4,11,3,16,30,22,11,30,10,7,16,9,5,15,3,3,4,2,1,5,0,3,4,2,1,5,0
1,67,1,0,120.0,229.0,0,0,129.0,1,2.6,1,1,6,12.0,3.0,3.0,6.0,4,12,9,4,12,4,13,28,22,11,31,11,0,2,3,1,6,1,0,2,2,1,6,1,4,12,9,5,15,4
2,37,1,2,130.0,250.0,0,1,187.0,0,3.5,0,0,3,13.0,4.0,4.0,8.0,6,13,10,5,14,3,3,4,5,3,8,1,2,3,4,2,8,0,6,12,10,6,15,3,2,3,3,2,8,0


In [None]:
data.shape

(362, 47)

In [None]:
train = data.copy()
target = train.pop('target')
train.head(2)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,age2,trestbps2,chol2,thalch2,oldpeak2,sex_cp,sex_trestbps2,sex_chol2,sex_thalch2,sex_oldpeak2,sex_slope,age2_cp,age2_trestbps2,age2_chol2,age2_thalch2,age2_oldpeak2,age2_slope,fbs_cp,fbs_trestbps2,fbs_chol2,fbs_thalch2,fbs_oldpeak2,fbs_slope,restecg_cp,restecg_trestbps2,restecg_chol2,restecg_thalch2,restecg_oldpeak2,restecg_slope,exang_cp,exang_trestbps2,exang_chol2,exang_thalch2,exang_oldpeak2,exang_slope
0,63,1,3,145.0,233.0,1,0,150.0,0,2.3,0,6,14.0,3.0,3.0,5.0,7,14,9,4,11,3,16,30,22,11,30,10,7,16,9,5,15,3,3,4,2,1,5,0,3,4,2,1,5,0
1,67,1,0,120.0,229.0,0,0,129.0,1,2.6,1,6,12.0,3.0,3.0,6.0,4,12,9,4,12,4,13,28,22,11,31,11,0,2,3,1,6,1,0,2,2,1,6,1,4,12,9,5,15,4


In [None]:
num_features_opt = 25   # the number of features that we need to choose as a result
num_features_max = 35   # the somewhat excessive number of features, which we will choose at each stage
features_best = []

In [None]:
# Threshold for removing correlated variables
threshold = 0.9

def highlight(value):
    if value > threshold:
        style = 'background-color: black'
    else:
        style = 'background-color: blue'
    return style

# Absolute value correlation matrix
corr_matrix = data.corr().abs().round(2)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.style.format("{:.2f}").applymap(highlight)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,target,age2,trestbps2,chol2,thalch2,oldpeak2,sex_cp,sex_trestbps2,sex_chol2,sex_thalch2,sex_oldpeak2,sex_slope,age2_cp,age2_trestbps2,age2_chol2,age2_thalch2,age2_oldpeak2,age2_slope,fbs_cp,fbs_trestbps2,fbs_chol2,fbs_thalch2,fbs_oldpeak2,fbs_slope,restecg_cp,restecg_trestbps2,restecg_chol2,restecg_thalch2,restecg_oldpeak2,restecg_slope,exang_cp,exang_trestbps2,exang_chol2,exang_thalch2,exang_oldpeak2,exang_slope
age,,0.09,0.0,0.24,0.01,0.18,0.01,0.31,0.03,0.05,0.06,0.04,0.95,0.23,0.01,0.3,0.05,0.09,0.02,0.09,0.23,0.05,0.12,0.92,0.92,0.93,0.91,0.92,0.92,0.14,0.24,0.16,0.03,0.18,0.13,0.01,0.07,0.01,0.11,0.03,0.01,0.04,0.09,0.04,0.08,0.04,0.01
sex,,,0.11,0.02,0.18,0.11,0.09,0.1,0.19,0.21,0.13,0.3,0.11,0.03,0.19,0.1,0.22,0.87,0.92,0.93,0.88,0.85,0.92,0.13,0.1,0.14,0.14,0.05,0.13,0.03,0.11,0.0,0.06,0.2,0.04,0.04,0.09,0.03,0.05,0.15,0.04,0.16,0.18,0.12,0.18,0.24,0.15
cp,,,,0.0,0.0,0.11,0.2,0.37,0.45,0.18,0.18,0.42,0.0,0.01,0.01,0.33,0.17,0.4,0.1,0.11,0.06,0.17,0.03,0.26,0.0,0.0,0.08,0.04,0.04,0.65,0.1,0.09,0.25,0.01,0.2,0.22,0.18,0.2,0.09,0.23,0.14,0.06,0.4,0.44,0.38,0.43,0.41
trestbps,,,,,0.14,0.09,0.08,0.09,0.14,0.14,0.04,0.16,0.21,0.99,0.13,0.1,0.14,0.02,0.32,0.07,0.03,0.09,0.0,0.2,0.36,0.23,0.19,0.23,0.2,0.06,0.41,0.14,0.03,0.14,0.05,0.08,0.33,0.12,0.04,0.12,0.07,0.15,0.4,0.18,0.11,0.16,0.13
chol,,,,,,0.0,0.17,0.14,0.02,0.06,0.1,0.07,0.0,0.15,0.97,0.11,0.04,0.16,0.11,0.2,0.11,0.11,0.13,0.0,0.01,0.18,0.03,0.0,0.02,0.0,0.05,0.49,0.05,0.02,0.05,0.17,0.12,0.12,0.13,0.13,0.14,0.02,0.03,0.3,0.02,0.0,0.02
fbs,,,,,,,0.03,0.02,0.02,0.06,0.1,0.02,0.19,0.08,0.01,0.02,0.06,0.16,0.12,0.12,0.1,0.12,0.07,0.21,0.18,0.19,0.19,0.19,0.16,0.83,0.9,0.86,0.89,0.87,0.86,0.02,0.0,0.03,0.04,0.0,0.06,0.04,0.0,0.01,0.03,0.01,0.06
restecg,,,,,,,,0.29,0.24,0.2,0.15,0.18,0.01,0.1,0.17,0.2,0.19,0.02,0.13,0.02,0.02,0.16,0.02,0.05,0.04,0.02,0.04,0.05,0.02,0.13,0.04,0.11,0.12,0.07,0.1,0.91,0.95,0.96,0.94,0.94,0.95,0.15,0.26,0.18,0.19,0.26,0.19
thalch,,,,,,,,,0.5,0.35,0.4,0.44,0.28,0.11,0.14,0.9,0.33,0.09,0.13,0.05,0.33,0.25,0.06,0.17,0.26,0.25,0.05,0.34,0.19,0.22,0.02,0.09,0.43,0.15,0.22,0.13,0.29,0.24,0.01,0.37,0.16,0.35,0.48,0.44,0.21,0.53,0.37
exang,,,,,,,,,,0.36,0.37,0.47,0.03,0.15,0.03,0.46,0.35,0.05,0.25,0.18,0.04,0.33,0.04,0.09,0.05,0.03,0.09,0.11,0.05,0.27,0.06,0.03,0.23,0.16,0.21,0.05,0.28,0.23,0.09,0.33,0.12,0.86,0.94,0.94,0.94,0.94,0.92
oldpeak,,,,,,,,,,,0.48,0.35,0.05,0.17,0.04,0.33,0.99,0.1,0.25,0.23,0.04,0.69,0.01,0.01,0.06,0.05,0.04,0.28,0.05,0.05,0.12,0.08,0.1,0.54,0.19,0.13,0.23,0.21,0.09,0.52,0.05,0.3,0.37,0.36,0.27,0.66,0.19


In [None]:
# Select columns with correlations above threshold
collinear_features = [column for column in upper.columns if any(upper[column] > threshold)]
features_filtered = data.drop(columns = collinear_features)
print('The number of features that passed the collinearity threshold: ', features_filtered.shape[1])
features_best.append(features_filtered.columns.tolist())

The number of features that passed the collinearity threshold:  23


In [None]:
lsvc = LinearSVC(C=0.1, penalty="l1", dual=False).fit(train, target)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(train)
X_selected_df = pd.DataFrame(X_new, columns=[train.columns[i] for i in range(len(train.columns)) if model.get_support()[i]])
features_best.append(X_selected_df.columns.tolist())

In [None]:
lasso = LassoCV(cv=3).fit(train, target)
model = SelectFromModel(lasso, prefit=True)
X_new = model.transform(train)
X_selected_df = pd.DataFrame(X_new, columns=[train.columns[i] for i in range(len(train.columns)) if model.get_support()[i]])
features_best.append(X_selected_df.columns.tolist())

In [None]:
# Visualization from https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e
# but to k='all'
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(train, target)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(train.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Feature','Score']  #naming the dataframe columns
features_best.append(featureScores.nlargest(num_features_max,'Score')['Feature'].tolist())
print(featureScores.nlargest(len(dfcolumns),'Score')) 

              Feature       Score
44     exang_oldpeak2  465.998249
41    exang_trestbps2  334.796553
7              thalch  313.526041
42        exang_chol2  157.634247
20       sex_oldpeak2  132.273799
17      sex_trestbps2  103.586280
38   restecg_oldpeak2   92.646747
15           oldpeak2   80.815162
2                  cp   72.968423
35  restecg_trestbps2   57.716217
4                chol   53.672855
43      exang_thalch2   51.270836
28             fbs_cp   51.166397
8               exang   45.422026
32       fbs_oldpeak2   40.018109
18          sex_chol2   38.775596
9             oldpeak   35.483990
40           exang_cp   34.227582
45        exang_slope   33.239711
3            trestbps   21.515854
31        fbs_thalch2   20.843879
36      restecg_chol2   16.770899
26      age2_oldpeak2   11.360552
10              slope   10.515031
1                 sex    9.930103
33          fbs_slope    9.679364
14            thalch2    9.434070
22            age2_cp    5.301202
21          se

In [None]:
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=num_features_max, step=10, verbose=5)
rfe_selector.fit(train, target)
rfe_support = rfe_selector.get_support()
rfe_feature = train.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

Fitting estimator with 46 features.
Fitting estimator with 36 features.
35 selected features


In [None]:
embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=200), threshold='1.25*median')
embeded_rf_selector.fit(train, target)

SelectFromModel(estimator=RandomForestClassifier(n_estimators=200),
                threshold='1.25*median')

In [None]:
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = train.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')

16 selected features


In [None]:
# Check whether all features have a sufficiently different meaning
selector = VarianceThreshold(threshold=10)
np.shape(selector.fit_transform(data))
features_best.append(list(np.array(data.columns)[selector.get_support(indices=False)]))

In [None]:
features_best

[['age',
  'sex',
  'cp',
  'trestbps',
  'chol',
  'fbs',
  'restecg',
  'thalch',
  'exang',
  'oldpeak',
  'slope',
  'target',
  'thalch2',
  'sex_cp',
  'sex_thalch2',
  'sex_oldpeak2',
  'fbs_cp',
  'fbs_trestbps2',
  'fbs_chol2',
  'fbs_thalch2',
  'fbs_oldpeak2',
  'fbs_slope',
  'exang_cp'],
 ['age',
  'cp',
  'trestbps',
  'chol',
  'thalch',
  'slope',
  'sex_chol2',
  'sex_oldpeak2',
  'age2_cp',
  'age2_trestbps2',
  'age2_slope',
  'fbs_cp',
  'restecg_thalch2',
  'exang_chol2',
  'exang_oldpeak2'],
 ['age',
  'cp',
  'trestbps',
  'chol',
  'thalch',
  'slope',
  'sex_chol2',
  'sex_oldpeak2',
  'age2_cp',
  'age2_trestbps2',
  'fbs_slope',
  'restecg_slope',
  'exang_chol2',
  'exang_oldpeak2'],
 ['exang_oldpeak2',
  'exang_trestbps2',
  'thalch',
  'exang_chol2',
  'sex_oldpeak2',
  'sex_trestbps2',
  'restecg_oldpeak2',
  'oldpeak2',
  'cp',
  'restecg_trestbps2',
  'chol',
  'exang_thalch2',
  'fbs_cp',
  'exang',
  'fbs_oldpeak2',
  'sex_chol2',
  'oldpeak',
  'exan

In [None]:
# The element is in at least one list of optimal features
main_cols_max = features_best[0]
for i in range(len(features_best)-1):
    main_cols_max = list(set(main_cols_max) | set(features_best[i+1]))
main_cols_max

['age2_thalch2',
 'trestbps2',
 'sex_chol2',
 'thalch',
 'sex_slope',
 'fbs',
 'exang_chol2',
 'sex',
 'restecg_trestbps2',
 'fbs_cp',
 'fbs_slope',
 'restecg_chol2',
 'exang_thalch2',
 'age',
 'fbs_oldpeak2',
 'fbs_thalch2',
 'age2_chol2',
 'target',
 'sex_oldpeak2',
 'age2_oldpeak2',
 'age2_cp',
 'restecg',
 'cp',
 'exang_trestbps2',
 'exang_slope',
 'exang',
 'age2_slope',
 'exang_cp',
 'exang_oldpeak2',
 'restecg_thalch2',
 'fbs_chol2',
 'slope',
 'sex_thalch2',
 'chol',
 'fbs_trestbps2',
 'restecg_slope',
 'thalch2',
 'trestbps',
 'restecg_oldpeak2',
 'sex_trestbps2',
 'oldpeak2',
 'sex_cp',
 'oldpeak',
 'age2_trestbps2']

In [None]:
len(main_cols_max)

44

In [None]:
# The element is in all lists of optimal features
main_cols_min = features_best[0]
for i in range(len(features_best)-1):
    main_cols_min = list(set(main_cols_min).intersection(set(features_best[i+1])))
main_cols_min

['chol', 'thalch', 'sex_oldpeak2', 'trestbps']

In [None]:
# Most common items in all lists of optimal features
main_cols = []
main_cols_opt = {feature_name : 0 for feature_name in data.columns.tolist()}
for i in range(len(features_best)):
    for feature_name in features_best[i]:
        main_cols_opt[feature_name] += 1
df_main_cols_opt = pd.DataFrame.from_dict(main_cols_opt, orient='index', columns=['Num'])
df_main_cols_opt.sort_values(by=['Num'], ascending=False).head(num_features_opt)

Unnamed: 0,Num
trestbps,5
chol,5
thalch,5
sex_oldpeak2,5
age,4
slope,4
exang_oldpeak2,4
exang_chol2,4
age2_trestbps2,4
sex_chol2,4


In [None]:
main_cols = df_main_cols_opt.nlargest(num_features_opt, 'Num').index.tolist()
if not 'target' in main_cols:
    main_cols.append('target')
main_cols

['trestbps',
 'chol',
 'thalch',
 'sex_oldpeak2',
 'age',
 'cp',
 'slope',
 'sex_chol2',
 'age2_cp',
 'age2_trestbps2',
 'exang_chol2',
 'exang_oldpeak2',
 'fbs_cp',
 'fbs_trestbps2',
 'fbs_oldpeak2',
 'fbs_slope',
 'sex',
 'restecg',
 'exang',
 'oldpeak',
 'thalch2',
 'sex_trestbps2',
 'sex_thalch2',
 'age2_oldpeak2',
 'fbs_thalch2',
 'target']