In my first guide for Machine Learning for Beginners, I have been explaining as easily as possible how classification models work. As mentioned in the first guide, you are still a beginner if you are reading this article, so you cannot work on projects that are too hard as of now, for example, Neural Networks or even NLP problems.
In the first phase of your learning, you will need to focus on Supervised Learning only, then you will be able to work on more complex models that require many variables, and only then you can switch to Unsupervised.
You can access and run the entire code (dataset included) from our repo.
The theory behind linear regression
This is probably the simplest Machine Learning algorithm you will even find. It is so simple that you may even ask yourself why this is considered AI. Essentially, it all consists of drawing a line to approximate the pattern behind your data points.
However, you may not want to understand it this way. You need to understand what links this regression algorithm with classification algorithms, only in this way you can know what Supervised means.
Explained very easily: a supervised AI program is structured in this way: we start from tabular data. We divide the columns into two kinds, the columns we wish to predict (labels) and the ones that will act as predictors (features). Once split, then we can train the model. This is valid for both regression (when labels are numerical) and classification (when labels are categorical).
Structuring the algorithm
The first step for building any algorithm, after having understood the theory clearly, is to outline which are necessary steps for building it. In the case of our regression model, these are the steps:
- Importing the dataset
- Feature and label selection
- Train and test split
- Train the model
- Evaluate the model
Note that in this example I am only going to use linear regression on two columns: one feature, and one label. By using more columns, it would be much more difficult to visualize (the maximum is 3D!). When you have too many dimensions to visualize, you can compress them with a method called PCA, however, for now, it is still too complex.
When using simple models you can only use one label, but you can use how many features you want that will act as predictors.
1. Importing the dataset
For this problem, rather than working on real data, I will use a sample of 10 random data points, mostly because I want to keep things as simple as possible. The principle of real data is the same. I could make them random using a function, but every time you run the program you will have a different dataset. It would be harder to follow, especially when you have to follow specific instructions for an article.
import pandas as pd import numpy as np df = pd.DataFrame([ [0, 1], [0.5, 1.6], [0.8, 2.3], [1.2, 2], [1.4, 2.1], [2.1, 2.4], [2.4, 2.6], [2.6, 2.1], [3.4, 2.3], [4, 3]]) df
Note that when working on a regression model you should either Normalize or Standardize the data. When the data follows a normal distribution, you can standardize it, while if it does not, then normalization is the best choice. Sklearn has specific functions that easily allow you to perform these preprocessing methods.
3. Feature and label selection
To use input more features into your model just put them into X (if you have any trouble you can comment and we will help you with the code).
X = df[] y = df[]
4. Train and test split
Do we need train and test sets for this exercise? Not really, because a regression model is mostly used to predict data for the future. It is harder to assess the how well a regression model is performing, compared to a classification model.
5. Train the model
By using the sklearn pipeline, we do not need to preprocess the data, as it is all included in the function. I will use its linear regression endpoint.
reg = LinearRegression().fit(X, y) reg reg.score(X, y)
6. Make a prediction
Compared to a classification model, we are not much interested in knowing how well it performs on this simple dataset, but rather making a future prediction.
After the model has been trained (green line), we can use it to make predictions on future data points or simply nonexisting data points. A point with an x-axis of 5, for example, will have an estimated y value of 3.23.
reg.predict([]) \ 3.23342267
What is next? -> Pokemon classifier with knn algorithm
Did you find this guide useful? If you wish to explore more code, you can check the list of projects that will progressively help you learn data science. If you wish to have a general idea of what you are learning, we have prepared a guide that contains the list of the most important concept you will need to learn to become a Machine Learning engineer.