Sitemap

Mastering Machine Learning: Logistic Regression to Ensemble Methods

5 min readAug 31, 2024

Introduction

Machine learning is an exciting process and every stage is based on the previous one. In this article, I’ll guide you through five complex machine learning tasks from logistic regression to ensemble methods such as random forest and gradient boosting. We will discuss about the model assessment methods, cross validation, overfitting and underfitting and some of the performance measures such as precision, recall, F1 score and ROC AUC. Each task is performed with the help of the most popular datasets, which will help to give a step-by-step guide to these concepts.

Predicting Diabetes Onset Using Logistic Regression

In this task, I was able to understand how to use logistic regression for binary classification using Diabetes Dataset. The process included data cleaning and preparation, which included how to deal with missing values and how to encode categorical variables. Through training of the logistic regression model and assessing the performance of the model using accuracy, precision, and recall, I learned how logistic regression can be used to predict health conditions. This task also helped to understand the significance of data preprocessing and the use of various performance measures in assessing the efficiency of the model.

Model Training: A logistic regression model is instantiated and trained using the training data (X_train and y_train).

# Model training
model = LogisticRegression()
model.fit(X_train, y_train)

Model Evaluation: The model’s performance is evaluated using accuracy, precision, and recall. These metrics provide insight into how well the model is predicting diabetes onset.

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

Classifying Iris Species Using Decision Trees

The decision tree classifier was used in this project to classify species using the Iris Dataset and the experience gained was insightful on how decision trees function. I got to know how to manage and prepare data properly, such as handling missing values and normalizing features. When using a confusion matrix and accuracy score to assess the model, I realized that decision trees are effective and easy to explain. This task highlighted the benefits of decision trees in making clear classification choices and the importance of evaluating model performance using metrics.

Model Training: A decision tree classifier is instantiated with a maximum depth of 3 to prevent overfitting. The model is then trained using the training data.

# Model training
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

Model Evaluation: The confusion matrix and accuracy score are calculated to assess the model’s performance. The confusion matrix provides detailed insights into true positives, true negatives, false positives, and false negatives.

# Evaluating the model
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

Predicting Titanic Survival Using Logistic Regression

This task entailed predicting Titanic survival using logistic regression, and here I got to do more advanced data preprocessing like imputing missing values with median and encoding categorical variables. Learning about binary classification metrics through the application of ROC-AUC helped me understand the intricacies of the model’s performance. Drawing the ROC curve and calculating the AUC helped understand other methods of model evaluation that focus on the model’s capacity to classify instances, reminding of the necessity of a more thorough evaluation of the model.

Model Training and Prediction: A logistic regression model is trained on the Titanic dataset. The predict_proba method returns the probabilities for each class, and we select the probability of the positive class (survival).

# Model training and prediction
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_prob = model.predict_proba(X_test)[:,1]

ROC-AUC Calculation: The ROC-AUC score is calculated to evaluate the model’s ability to distinguish between the classes. The false positive rate (fpr) and true positive rate (tpr) are computed for plotting the ROC curve.

# ROC-AUC evaluation
roc_auc = roc_auc_score(y_test, y_pred_prob)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

ROC Curve Plotting: The ROC curve is plotted, showing the trade-off between sensitivity (true positive rate) and specificity (1 — false positive rate). The area under the curve (AUC) provides a single metric to evaluate the model’s performance.

plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Applying Cross-Validation to Random Forest Classifier

The most valuable lesson of using a random forest classifier with k-fold cross-validation was in learning about model stability and its ability to be less sensitive to small changes in data. Cross validation scores helped me to understand how to assess the model’s performance better than using train/test split only. This task also made me aware of the importance of ensemble methods as well as cross-validation in minimizing overfitting. The practical application of these techniques helped me to learn how to make sure that a model is able to generalize to unseen data.

Model Training with Cross-Validation: A random forest classifier with 100 trees (n_estimators=100) is trained using 5-fold cross-validation. This means the dataset is split into 5 parts, and the model is trained and tested 5 times, each time using a different fold as the test set.

# Model training with cross-validation
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=5)

Cross-Validation Evaluation: The cross-validation scores for each fold are printed, and the mean score is calculated to assess the model’s overall performance and stability.

print('Cross-Validation Scores:', cv_scores)
print('Mean CV Score:, v_scores.mean())

Investigating Overfitting and Underfitting in Gradient Boosting Machines

In this task, I explored the details of gradient boosting machines and how the parameters such as the number of estimators and learning rates influences the model. The lesson on overfitting and underfitting by comparing the training and validation accuracy helped in understanding how to control the model complexity and its performance. Through this practical session on gradient boosting, participants gained an appreciation of the need to tweak model parameters and how to do this based on experimentations.

Model Training with Varying Parameters: The model is trained with different numbers of estimators (n_estimators = 50, 100, 150) to observe how the model's performance changes. For each setting, the model's accuracy on the training set (train_scores) and validation set (valid_scores) is recorded.

# Model training with varying parameters
train_scores = []
valid_scores = []

for n_estimators in [50, 100, 150]:
model = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

train_scores.append(model.score(X_train, y_train))
valid_scores.append(model.score(X_test, y_test))

Plotting Results: The training and validation accuracies are plotted against the number of estimators. This plot helps visualize whether the model is overfitting (high training accuracy, low validation accuracy) or underfitting (both accuracies low).

plt.plot([50, 100, 150], train_scores, label='Training Accuracy')
plt.plot([50, 100, 150], valid_scores, label='Validation Accuracy')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Conclusion

These tasks offered a detailed analysis of numerous facets of machine learning including logistic regression, decision trees, random forests and gradient boosting. Through the understanding of cross-validation, overfitting and underfitting, as well as key performance indicators, I understood how to construct, assess and fine-tune models. This journey is progressive in fashion where each step taken prepares me for the next step in solving more challenging problems in the future.

--

--

Mohammad Yaqoob
Mohammad Yaqoob

Written by Mohammad Yaqoob

0 followers

👋 Hi there! I’m a Computer Science student at Sukkur IBA University and a Machine Learning Engineer. Follow for insights on tech, coding, and data science!

No responses yet