INTRODUCTION
In this blog,we will analyse 'Heart' data to predict heart attack risk using logistic regression.
Importing Necessary Packages:
Reading the data:
Let us understand the Heart dataset. The dataset has 14 columns with the following information:
- Age: Age of the individual.
- Sex: Gender (1 = male, 0 = female).
- Chest-pain type: Type of chest pain (1 = typical angina, 2 = atypical, 3 = non-anginal, 4 = asymptotic).
- Resting BP: Resting blood pressure in mmHg.
- Serum Cholesterol: Serum cholesterol in mg/dl.
- Fasting Blood Sugar: Fasting blood sugar > 120mg/dl coded as 1, else 0.
- Resting ECG: Electrocardiographic results (0 = normal, 1 = ST-T wave abnormality, 2 = left ventricular hypertrophy).
- Max Heart Rate: Maximum heart rate achieved.
- Exercise Induced Angina: 1 = yes, 0 = no.
- ST Depression: Depression induced by exercise relative to rest.
- Peak Exercise ST Segment: 1 = upsloping, 2 = flat, 3 = downsloping.
- Vessels Colored: Number of major vessels (0–3) colored by fluoroscopy.
- Thal: Thalassemia (3 = normal, 6 = fixed defect, 7 = reversible defect).
- Diagnosis: Presence (1, 2, 3, 4) or absence (0) of heart disease.
DATA INSPECTION:-
Exploratory Data Analysis:
EDA for data is done in simple methods by one of the pandas_profiling library.
Pandas ProfileReport pdf link below
On generating the pandas profiling for exploratory data analysis it is clear that there are duplicate rows present in the dataset.so we are removing the duplicates.
Pandas ProfileReport pdf link below
The dataset is fairly balanced.
MODEL BUILDING:
Creating the logistic regression model:
Importing the evaluation metrics:
The accuracy of the model is appx.83% which is a good model. In logistic regression model,we said that we get output in terms of probability:-(predict_proba)
ROC_AUC(Receiver Operating Characteristics and Area Under Curve):
#For a good model,a minimum score of 0.8 is required. A higher score signifies the model's better ability to distinguish between positive and negative classes."
Plotting the ROC_AUC_CURVE: