In this post, I am going to teach you step-by-step how to build one of the simplest Machine Learning projects. I will begin by explaining just the minimum necessary theory on how to structure our project (all the code will be in this article, do not worry), and then I will code it from scratch. If you wish to know easily how an AI program works, you can quickly read this article in the nocode explanation session. I created a repository where I save the code for every single pythonkai project.
This is my first project: should I start from here?
When you start studying machine learning, you will probably need to start from Supervised Learning on tabular data, because all the other options, like NLP and computer vision, require you to have already a good knowledge of how to code and of the main Python libraries for data science. You can follow our guides on how to study data science if you need direction. We also have guides on what are the right steps to learn python.
Therefore, when you start with Supervised Machine Learning, you can only work on 2 kinds of problems: Regression or Classification models. The difference between the two is very simple. You both have a set of columns that act as predictors (features), and one or more columns that you wish to predict (labels).
- In classification, labels are categorical
- In regression, features are numerical
Classification is slightly easier than regression because you can measure the accuracy of a model with a percentage. Instead, for regression problems, you may need different metrics like MSE that are harder to measure.
Structuring the program
In two sentences, a supervised AI program is structured in this way: we start from tabular data, in our case, the Titanic Dataset. We divide the columns into two kinds, the columns we wish to predict (labels) and the ones that will act as predictors (features). Once split, then we can train the model.
In the Titanic dataset, I wish to predict the survivors (labels), while I want to get rid of the unnecessary columns. There is a huge number of possible analyses you can make when selecting the data you need for training, this selection is called feature engineering. However, because it usually takes years to master all the possible techniques, I will only make simple assumptions to conserve the data that I think is relevant for our prediction.
- Importing the dataset
- Feature and label selection
- Train and test split
- Train the model
- Evaluate the model
1. Importing the dataset
Our first step will be to load the dataset. The gold standard for working on tabular data is a library called pandas. It may take you some time to get used to the syntax. After downloading the .csv file here, I will load it using a function called read_csv.
import pandas as pd df = pd.read_csv('titanic_dataset.csv').dropna()
There are several features I do not wish to select because they are not going to contribute to the predictive result of my model, actually, they might even confuse the mode. I am only going to select some of the features that I think are going to act as good predictors.
#get rid of useless features df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Parch']]
3. Feature and label selection
As mentioned before, I will need to divide the dataset into features and labels by using the following code. Because some of the features (sex) are in the form of categorical data, I will need to encode them. I can easily do that with a function called get_dummies, which will perform an one_hot encoding on the “sex” column.
#labels y = df.pop('Survived') display(y) #features X = df X = pd.get_dummies(X) display(X)
4. Train and test split
I will train my model by using the train set, and I will be using my test set to see how well it performs.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.28, random_state=0)
5. Train the model
It is finally time to train the model. The easiest and most complete library for making Machine Learning models is called sklearn: I will use one of the many possible models for classification models called Support Vector Machines.
from sklearn.svm import SVC from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler clf = make_pipeline(StandardScaler(), SVC(gamma='auto')) clf.fit(X_train, y_train)
6. Evaluate model
Let us now see how the model performs on some data it has never seen before. This step is very important because if you test the model on the train data, the model will recognize the data and the results will be biased.
clf.score(X_test, y_test) \ 0.8076923076923077
We reached an accuracy of 80%! After some tuning, I understood that this dataset can reach an accuracy between 75% and 82% depending on the proportion and the data contained in the test and train set.
Is 80% accuracy enough?
I know that you would like to reach 99% immediately, but trust me this is not how it works. The point is not to create a killing model but is to have relevant data that you can use. The better the data, the better the model. Sometimes, if the data is not good enough, you can even reach a 30% accuracy.
What is next? -> Creating a Linear Regression Model