Machine Learning for Beginners; Project 1: predicting Titanic Survivors

In this post, I am going to teach you step-by-step how to build one of the simplest Machine Learning projects. I will begin by explaining just the minimum necessary theory on how to structure our project (all the code will be in this article, do not worry), and then I will code it from scratch. If you wish to know easily how an AI program works, you can quickly read this article in the nocode explanation session. I created a repository where I save the code for every single pythonkai project.

This is my first project: should I start from here?

When you start studying machine learning, you will probably need to start from Supervised Learning on tabular data, because all the other options, like NLP and computer vision, require you to have already a good knowledge of how to code and of the main Python libraries for data science. You can follow our guides on how to study data science if you need direction. We also have guides on what are the right steps to learn python.

Therefore, when you start with Supervised Machine Learning, you can only work on 2 kinds of problems: Regression or Classification models. The difference between the two is very simple. You both have a set of columns that act as predictors (features), and one or more columns that you wish to predict (labels).

  • In classification, labels are categorical
  • In regression, features are numerical

Classification is slightly easier than regression because you can measure the accuracy of a model with a percentage. Instead, for regression problems, you may need different metrics like MSE that are harder to measure.

Structuring the program

In two sentences, a supervised AI program is structured in this way: we start from tabular data, in our case, the Titanic Dataset. We divide the columns into two kinds, the columns we wish to predict (labels) and the ones that will act as predictors (features). Once split, then we can train the model.

In the Titanic dataset, I wish to predict the survivors (labels), while I want to get rid of the unnecessary columns. There is a huge number of possible analyses you can make when selecting the data you need for training, this selection is called feature engineering. However, because it usually takes years to master all the possible techniques, I will only make simple assumptions to conserve the data that I think is relevant for our prediction.

  1. Importing the dataset
  2. Preprocessing
  3. Feature and label selection
  4. Train and test split
  5. Train the model
  6. Evaluate the model

1. Importing the dataset

Our first step will be to load the dataset. The gold standard for working on tabular data is a library called pandas. It may take you some time to get used to the syntax. After downloading the .csv file here, I will load it using a function called read_csv.

import pandas as pd

df = pd.read_csv('titanic_dataset.csv').dropna()
Titanic dataset sample

2. Preprocessing

There are several features I do not wish to select because they are not going to contribute to the predictive result of my model, actually, they might even confuse the mode. I am only going to select some of the features that I think are going to act as good predictors.

#get rid of useless features
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Parch']]
Dataset after preprocessing

3. Feature and label selection

As mentioned before, I will need to divide the dataset into features and labels by using the following code. Because some of the features (sex) are in the form of categorical data, I will need to encode them. I can easily do that with a function called get_dummies, which will perform an one_hot encoding on the “sex” column.

#labels
y = df.pop('Survived')
display(y)

#features
X = df
X = pd.get_dummies(X)
display(X)
features for the training dataset
labels for the training dataset

4. Train and test split

I will train my model by using the train set, and I will be using my test set to see how well it performs.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.28, random_state=0)

5. Train the model

It is finally time to train the model. The easiest and most complete library for making Machine Learning models is called sklearn: I will use one of the many possible models for classification models called Support Vector Machines.

from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)

6. Evaluate model

Let us now see how the model performs on some data it has never seen before. This step is very important because if you test the model on the train data, the model will recognize the data and the results will be biased.

clf.score(X_test, y_test)
\
0.8076923076923077

We reached an accuracy of 80%! After some tuning, I understood that this dataset can reach an accuracy between 75% and 82% depending on the proportion and the data contained in the test and train set.

Is 80% accuracy enough?

I know that you would like to reach 99% immediately, but trust me this is not how it works. The point is not to create a killing model but is to have relevant data that you can use. The better the data, the better the model. Sometimes, if the data is not good enough, you can even reach a 30% accuracy.

Join our free programming community on discord, learn how to code, and meet other experts

What is next? -> Creating a Linear Regression Model

Processing…
Welcome to the dojo! We will send you a notification when new projects are ready for you to read!

3 thoughts on “Machine Learning for Beginners; Project 1: predicting Titanic Survivors

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: