As you have probably seen from my previous projects of Machine Learning for beginners, at the beginning of the learning path that I have outlined for you I heavily focus on classification, rather than regression. This is because classification is a much simpler problem to code than regression (much less preprocessing, for example), and the best metrics that show how well is a model performing can be expressed directly in percent. The full code is available in my repo.
Especially because you are a beginner (would not make much sense to follow these projects, otherwise), that you should gradually study Machine Learning, rather than directly adventuring in impossible journeys like neural networks or Computer Vision. Fear not, your time will come.
Naive Bayes Classifier
The Naive Bayes Classifier is probably the simplest Machine Learning algorithm in circulation after the linear regression. It was invented at the beginning of the 20th century, way before computers were used to make computations, so, as you can imagine, it is an algorithm that has to be simple enough to be completed by hand. However, this algorithm really makes you understand a big deal about classification, it is probably one of the most useful algorithms for simple Machine Learning use cases.
The structure behind this algorithm is quite simple: the algorithm avoids considering each individual sample in the dataset, like every other ML algorithm. Instead, we only need to input the mean, standard deviation, and alfa for the feature belonging to each class. Most of the data is lost in the process because is not even considered. However, if all features are normally distributed, this algorithm can be extremely accurate.
The main issue is that in real life, most datasets do not follow very precise distributions, are require more advanced models to reach a higher level of accuracy.
Does Naive Bayes classifier overuse pure statistics?
Most importantly, this algorithm NECESSARILY needs you to understand the importance of using distributions in data science, because is not computationally heave at all. This time, your PC will not save you. Statistics offers you a huge pool of numerical distributions to pick from when modeling your data. Knowing the best way to model your data, for example, if it belongs to a lognormal distribution or a normal distribution, really makes a difference when building models.
Think about it this way, when your data volume is big enough, you cannot afford to simply build a model by pressing refresh. Each attempt will cost you a great deal of money (GPT-3 training, for example, did cost around 12 Million USD). The only way to gain knowledge about your data is to use statistics and mathematics: from there, you can get the “instruction” on how to tune the hyperparameters for a Machine Learning model.
Structuring the algorithm
The first step for building any algorithm, after having understood the theory clearly, is to outline which are necessary steps for building it. In the case of our decision tree classifier, these are the steps we are going to follow:
- Importing the dataset
- Exploratory Data Analysis (EDA)
- Feature and label selection
- Train and test split
- Train the model
Compared with the previous projects, I cut out making a prediction from the sections. Because up to know you still were a beginner, it was making sense for me to outline the potential of these algorithms; now that you know that the prediction method is available in all machine learning models, there is no need to show you how it works here, you can simply call the method looking at the sklearn library of the Naive Bayes Classifier.
1. Importing the dataset
The dataset I am going to use in this project (available at this link) has been created with a collection of spicy pepper measurements by using the probability distribution function. If you are interested in the process of creating this dataset, this is the direct link to my old article on the matter.
import pandas as pd df = pd.read_csv('spicy_pepper_dataset.csv') df
2. Exploratory Data Analysis
The dataset is already been edited, we do not need to preprocess it more than necessary. Before even touching the dataset, all Data Scientists spend time understanding which data they have at their disposal if they need to clean it or drop empty values. This process is so important that may take up to 80% of the data scientist’s time.
For now, you have been used to training a model by pressing a single button (and you will be privileged of this advantage until the end of the beginner course). When you have Gigabytes to analyze, it is really important to understand which data you have before even touching it, because a simple action like cleaning a dataset of 3GB will likely take your PC 3 hours. If you have an urgent problem and your time is scarce, then it is better to do homework before starting to work.
Although this dataset is very small and simple, let us explore it, so that you can understand why it is perfect for a very simple classification algorithm like Naive Bayes Classifier.
Because this dataset has already been cleaned before even uploading it into the project repository, there is little preprocessing to do. However, I still can notice that among my features, color is still in the form of categorical data. I will need to encode it into numerical data, so that we can fit it into the model. I can accomplish this with a single line of code, using one hot encoding.
I will use the one-hot encoding function after having separated the features and labels. Is this a common practice, to preprocess after having split features and labels? Absolutely not, but to save you several lines of code, this is the best line of action.
import seaborn import matplotlib.pyplot as plt #pairplot with hue name seaborn.pairplot(df, hue ='name') # to show plt.show()
The dataset only has two numerical features, with and height, As we can see from the graph, the normal distributions of the features are barely overlapping (when considered together). In fact, by plotting every single sample in a Cartesian plane, most of them do not occupy the same region. This shows already how perfect this data is for classification purposes because the model will not make confusion.
Also, because all distributions are normally distributed, we can input them into the Naive Bayes classifier algorithm without fear of losing important data in the process (in the sense that the algorithm will not consider anything else than the dispersion metrics of their distributions, but all the important data of each feature can be represented using those metrics).
4. Feature and label selection
As usual, I will separate features and labels. Because one of the features contains the color of each Spicy Pepper, which is in the form of categorical data, I will need to convert it into numerical data after the split, just to save us the need of reattaching this dataset to the original.
X = df[['heigth', 'width', 'color']] X = pd.get_dummies(X) y = df[['name']]
5. Train and test split
As usual, to train the model in a supervised fashion, we will need to split the features and labels into test and train sets. In this case, I arbitrarily choose a 20/80 split ratio.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
6. Train the model
To train the classifier I will use the Naive Bayes Classifier on sklearn.
from sklearn.naive_bayes import GaussianNB clf = GaussianNB() clf.fit(X_train, y_train) clf.score(X_test, y_test, sample_weight=None) \ 0.9723333333333334
Because of the high quality of the data, our classifier score has reached 97%! However, know that if this is one of the first projects you might expect a much lower accuracy, let’s say around 60-70% because the score of your model does not only depend on the way you tune it, but also on the quality of your data.
Did you find this guide useful? If you wish to explore more code, you can check the list of projects that will progressively help you learn data science. If you wish to have a general idea of what you are learning, we have prepared a guide that contains the list of the most important concept you will need to learn to become a Machine Learning engineer.