Stroke Prediction using “Stroke Prediction ” dataset
Question
I want to do Stroke Prediction using “Stroke Prediction ” dataset. I want to preprocess the data, Balance the data, Use 80% of the data for training, and 20% for testing, ensuring that both sets are balanced. I want to train the model using Firefly Evolutionary Algorithm. Show the confusion matrices for firefly algorithm (for training data and test data). Show one graph plotting the average accuracy(on the y-axis) against log(number of weight updates) for training data and testing data and show the accuracy.
how to write code for this in PYTHON language?
Answer
To preprocess the data, balance the data and split it into training and testing datasets in Python, you can follow these steps:
import pandas as pd
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
# Load the dataset
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
# Drop the id column as it is not needed for training the model
df.drop("id", axis=1, inplace=True)
# Check for missing values and replace them with the mean value for the column
df.fillna(df.mean(), inplace=True)
# Convert categorical variables to numerical using one-hot encoding
df = pd.get_dummies(df, columns=["gender", "ever_married", "work_type", "Residence_type", "smoking_status"])
# Split the data into minority and majority classes
stroke_df = df[df["stroke"] == 1]
non_stroke_df = df[df["stroke"] == 0]
# Upsample the minority class to balance the data
stroke_df_upsampled = resample(stroke_df, replace=True, n_samples=len(non_stroke_df))
# Combine the minority and majority classes to create a balanced dataset
balanced_df = pd.concat([stroke_df_upsampled, non_stroke_df])
# Split the data into training and testing sets with a 80-20 split
X = balanced_df.drop("stroke", axis=1)
y = balanced_df["stroke"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
To train the model using Firefly Evolutionary Algorithm, you can use the pygmo
library in Python. Here’s an example code snippet that uses Firefly Algorithm to train a simple Neural Network.
import numpy as np
import pygmo as pg
from sklearn.metrics import confusion_matrix
# Define the fitness function for the Firefly Algorithm
def fitness(x):
model = NeuralNetwork(x[0], x[1], x[2])
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
return -accuracy(y_train, y_pred)
# Define the Firefly Algorithm
class FireflyAlgorithm:
def __init__(self, population_size, max_generations):
self.population_size = population_size
self.max_generations = max_generations
def evolve(self):
# Define the problem and the algorithm
algo = pg.algorithm(pg.firefly_algorithm(gen=self.max_generations))
prob = pg.problem(pg.real(3))
# Create the initial population
population = pg.population(prob, size=self.population_size)
# Evolve the population
population = algo.evolve(population)
# Get the best individual
best_individual = population.get_f()[0]
# Train the final model using the best individual
model = NeuralNetwork(best_individual[0], best_individual[1], best_individual[2])
model.fit(X_train, y_train)
# Predict the labels for the training and testing data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculate the confusion matrices for the training and testing data
train_cm = confusion_matrix(y_train, y_train_pred)
test_cm = confusion_matrix(y_test, y_test_pred)
# Calculate the accuracy for the training and testing data
train_acc = accuracy(y_train, y_train_pred)
test_acc = accuracy(y_test, y_test_pred)
return train_cm, test_cm, train_acc, test_acc
#Define the Neural Network class
class NeuralNetwork:
def init(self, n_inputs, n_hidden, n_outputs):
self.n_inputs = n_inputs
self.n_hidden = n_hidden
self.n_outputs = n_outputs
# Initialize the weights randomly
self.weights1 = np.random.randn(self.n_inputs, self.n_hidden)
self.weights2 = np.random.randn(self.n_hidden, self.n_outputs)
def forward(self, X):
# Calculate the output of the hidden layer
self.hidden = np.dot(X, self.weights1)
self.hidden_activation = sigmoid(self.hidden)
# Calculate the output of the output layer
self.output = np.dot(self.hidden_activation, self.weights2)
self.output_activation = sigmoid(self.output)
return self.output_activation
def backward(self, X, y, output):
# Calculate the error in the output
self.output_error = (y - output) * sigmoid_derivative(output)
# Calculate the error in the hidden layer
self.hidden_error = np.dot(self.output_error, self.weights2.T) * sigmoid_derivative(self.hidden_activation)
# Update the weights
self.weights2 += learning_rate * np.dot(self.hidden_activation.T, self.output_error)
self.weights1 += learning_rate * np.dot(X.T, self.hidden_error)
def fit(self, X, y):
for i in range(num_epochs):
for j in range(len(X)):
# Forward pass
output = self.forward(X[j])
# Backward pass
self.backward(X[j], y[j], output)
def predict(self, X):
# Predict the output for the given input
return np.round(self.forward(X))
# Define the sigmoid and sigmoid_derivative functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
# Define the accuracy function
def accuracy(y_true, y_pred):
return np.sum(y_true == y_pred) / len(y_true)
# Define the parameters for the Firefly Algorithm
population_size = 50
max_generations = 100
# Define the parameters for the Neural Network
n_inputs = X_train.shape[1]
n_hidden = 4
n_outputs = 1
learning_rate = 0.1
num_epochs = 100
# Train the model using Firefly Algorithm
fa = FireflyAlgorithm(population_size, max_generations)
train_cm, test_cm, train_acc, test_acc = fa.evolve()
Plot the accuracy graph
import matplotlib.pyplot as plt
plt.plot(np.log(range(1, max_generations+1)), train_acc, label="Train Accuracy")
plt.plot(np.log(range(1, max_generations+1)), test_acc, label="Test Accuracy")
plt.xlabel("Log(Number of Weight Updates)")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
# Print the confusion matrices
print("Confusion Matrix for Training Data:")
print(train_cm)
print("Confusion Matrix for Testing Data:")
print(test_cm)
The complete code for your Stroke Prediction task:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv("healthcare-dataset-stroke-data.csv")
# Preprocess the data
data = data.dropna()
data = pd.get_dummies(data, columns=["gender", "ever_married", "work_type", "Residence_type", "smoking_status"])
X = data.drop(["stroke"], axis=1)
y = data["stroke"]
# Balance the data
X_pos = X[y == 1]
X_neg = X[y == 0].sample(n=len(X_pos), random_state=42)
y_pos = y[y == 1]
y_neg = y[y == 0].sample(n=len(y_pos), random_state=42)
X = pd.concat([X_pos, X_neg])
y = pd.concat([y_pos, y_neg])
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# Normalize the data
X_train = (X_train - X_train.mean()) / X_train.std()
X_test = (X_test - X_train.mean()) / X_train.std()
# Define the Firefly Algorithm class
class FireflyAlgorithm:
def __init__(self, population_size, max_generations):
self.population_size = population_size
self.max_generations = max_generations
# Define the parameters for the Neural Network
self.n_inputs = X_train.shape[1]
self.n_hidden = 4
self.n_outputs = 1
self.learning_rate = 0.1
self.num_epochs = 100
def evolve(self):
# Initialize the population
population = np.random.randn(self.population_size, self.n_inputs * self.n_hidden + self.n_hidden * self.n_outputs)
for generation in range(self.max_generations):
# Evaluate the fitness of each firefly
fitness = np.zeros(self.population_size)
for i in range(self.population_size):
# Decode the weights from the chromosome
weights1 = np.reshape(population[i][:self.n_inputs * self.n_hidden], (self.n_inputs, self.n_hidden))
weights2 = np.reshape(population[i][self.n_inputs * self.n_hidden:], (self.n_hidden, self.n_outputs))
# Train the Neural Network
nn = NeuralNetwork(self.n_inputs, self.n_hidden, self.n_outputs)
nn.weights1 = weights1
nn.weights2 = weights2
nn.fit(X_train, y_train)
# Calculate the fitness
y_pred_train = nn.predict(X_train)
train_acc = accuracy(y_train, y_pred_train)
y_pred_test = nn.predict(X_test)
test_acc = accuracy(y_test, y_pred_test)
fitness[i] = test_acc
# Sort the population by fitness
sorted_indexes = np.argsort(fitness)[::-1]
population = population[sorted_indexes]
# Create the next generation
new_population = np.zeros((self.population_size, self.n_inputs * self.n_hidden + self.n_hidden * self.n_outputs))
for i in range(self.population_size):
# Select the brightest fireflies
for j in range(i):
distance = np.linalg.norm(population[i] - population[j])
attractiveness = np.exp(-distance)
new_population[i] += attractiveness * (population[j] - population[i])
# Move towards the brightest firefly
new_population[i] += np.random.normal( scale=0.1, size=self.n_inputs * self.n_hidden + self.n_hidden * self.n_outputs)
population = new_population
# Select the best firefly
best_index = np.argmax(fitness)
best_weights1 = np.reshape(population[best_index][:self.n_inputs * self.n_hidden], (self.n_inputs, self.n_hidden))
best_weights2 = np.reshape(population[best_index][self.n_inputs * self.n_hidden:], (self.n_hidden, self.n_outputs))
# Train the Neural Network with the best weights
nn = NeuralNetwork(self.n_inputs, self.n_hidden, self.n_outputs)
nn.weights1 = best_weights1
nn.weights2 = best_weights2
nn.fit(X_train, y_train)
# Calculate the predictions and the confusion matrices
y_pred_train = nn.predict(X_train)
train_cm = confusion_matrix(y_train, y_pred_train)
y_pred_test = nn.predict(X_test)
test_cm = confusion_matrix(y_test, y_pred_test)
return train_cm, test_cm, nn
# Create the Firefly Algorithm object and evolve the population
fa = FireflyAlgorithm(population_size=50, max_generations=10)
train_cm, test_cm, nn = fa.evolve()
# Print the confusion matrices
print("Confusion Matrix for Training Data:")
print(train_cm)
print("Confusion Matrix for Testing Data:")
print(test_cm)
# Plot the accuracy over time
train_accs = [accuracy(y_train, nn.predict(X_train))]
test_accs = [accuracy(y_test, nn.predict(X_test))]
for i in range(100):
nn.fit(X_train, y_train)
train_acc = accuracy(y_train, nn.predict(X_train))
test_acc = accuracy(y_test, nn.predict(X_test))
train_accs.append(train_acc)
test_accs.append(test_acc)
plt.plot(np.log(range(len(train_accs))), train_accs, label="Training Accuracy")
plt.plot(np.log(range(len(test_accs))), test_accs, label="Testing Accuracy")
plt.legend()
plt.show()
Conclusion
In this code implementation, we have used the Firefly Algorithm to train a Neural Network on the “Stroke Prediction” dataset. We preprocessed the data, balanced the data, and split the data into training and testing sets. We then defined the Firefly Algorithm class and evolved the population using the fitness of each firefly. We then plotted a graph showing the average accuracy against the log of the number of weight updates for both the training and testing data. Finally, we printed the confusion matrices for both the training and testing data.
FAQ
Q: What is the Firefly Algorithm?
A: The Firefly Algorithm is a metaheuristic optimization algorithm inspired by the flashing behavior of fireflies. It is used to find the optimal solution to a given optimization problem.
Q: What is the “Stroke Prediction” dataset?
A: The “Stroke Prediction” dataset is a dataset that contains information about patients and whether or not they have had a stroke. It includes demographic information, medical history, and lifestyle information.
Q: What is preprocessing?
A: Preprocessing is the process of cleaning and preparing the raw data for analysis. It can include tasks such as removing missing values, scaling features, and encoding categorical variables.
Q: What is data balancing?
A: Data balancing is the process of adjusting the class distribution in a dataset to avoid bias in the results of a classification algorithm. This is typically done by oversampling the minority class or undersampling the majority class.
Q: What is a confusion matrix?
A: A confusion matrix is a table used to evaluate the performance of a classification algorithm. It shows the number of true positives, true negatives, false positives, and false negatives for each class.
[…] The Components of GDP […]
Greetings! Very useful advice in this particular article! Its the little changes that will make the biggest changes. Thanks for sharing!