Machine Learning for Beginners; Project 3: Pokemon k-nearest neighbors classifier

In Project 1 of the ML for beginners series, I have shown you how to build the simplest possible classification model. Because the dataset is not so difficult to work with, I could easily choose any of the available models. One of the simplest models to implement is called Support Vector Machines (SVM) and is exactly the one used to solve the problem. However, there are many classification algorithms, that can be used depending on each case.

You can find the entire code used in this algorithm in the pythonkai repo.

How do we know which algorithm to use?

The only way is to know how the math behind each algorithm works, so, as you can imagine, is a mixture between experience and knowledge. Trust me, experimenting on small datasets with many different classification algorithms does not make you gain any experience, because it is only with massive volumes of data that the math starts making sense. If the data volume is too small, even a small change in train and test proportions could result in a 5% change in accuracy.

The best way to learn different classification algorithms is to make a different project for each, and, most importantly, understand the peculiarity of each algorithm compared to others. With a few months of training and with several projects, you will be able to discern the difference between almost all the most important classification models.

What is nearest neighbor classification algortihm?

K-nn is really a peculiar algorithm that approaches a classification problem differently from all the other models. Instead of working through functions (like logistic regression), it works through datapoint. Lucky for us, the algorithm is very intuitive to understand.

As usual, our data is in tabular form and is structured in features that act as predictors for a label, which is the data we wish to predict. If we plot all the features (if they are numerical already; if they are not, we can use a method called encoding to turn categorical data into numerical data) in a Cartesian field, each sample in our dataset will correspond to one point in space. Each point in space will belong to a different class, essentially our label.

PCA of the Pokemon dataset
PCA of the Pokemon dataset with the new Pokemon, we can see the 5 closest neighbor extracted with knn

Given a new unlabeled point, like a new Pokemon (orange dot), we will place it in space with the new features and we will look at the closest k points in space (let’s say 5). As we can see, the surrounding neighbors are all Uncommon Pokemons (green dots): then, we will classify this new Pokemon as Uncommon. I will now show you how to code the entire algorithm to reach the same solution.

Structuring the algorithm

As usual, we will first need to conceptualize our algorithm into a series of comprehensible steps. Full code available in the pythonkai repo.

  1. Importing the dataset
  2. Preprocessing
  3. Features and Labels split
  4. Create new Pokemon
  5. k nearest-neighbor classifier (knn classifier)
  6. Make a prediction
  7. Run analysis
  8. Graph results

This is a step further from the algorithms I have been showing you so far, however, will let you familiarize yourself with a new approach for Machine Learning problem-solving. Remember that all information can be placed in a cartesian field, by being converted into a vector. Vector-based technology is one of the most advanced fields in Machine Learning, at this is your first taste of it.

1. Import dataset

The first step is importing the dataset and exploring it. You will find the dataset in the pythonkai repo, it is just a few kb. This first step is known as Exploratory Data Analysis (EDA). As we can see, our dataset contains two columns, Rank and Overall that we can delete during the preprocessing phase, because Rank is only a count of the Pokemon, while Overall is the sum of all the stats.

import pandas as pd

df = pd.read_csv('pokemon_gold_stats.csv', sep=';')
df
Snapshot of the original dataset

2. Preprocessing

During the preprocessing phase, I am going to take away all the unnecessary information from the dataset so that I can work on it. After getting rid of Rank and Overall (I will exclude them, there is still one problem: my Pokemon have no labels. What am I going to predict?

The point of a classification algorithm is predicting some labels after learning from the already existing labels in the data. Therefore, I will create my own labels depending on the Overall score

df.columns = ['Rank', 'Pokemon', 'Overall', 'HP', 'Atk', 'Def', 'SA', 'SD', 'Spd']
df
def ranking(x):
    if x in range(1001, 1350):
        return 'Common'
    if x in range(1351, 1700):
        return 'Uncommon'
    if x in range(1701, 2000):
        return 'Rare'
    if x >= 2001:
        return 'Legendary'

df['Rating'] = df['Overall'].apply(lambda x : ranking(x))
df = df.drop('Rank', axis=1)
df

.

Snapshot fo the dataset with the addition of a Rating

3. Features and Labels split

The k nearest neighbor (knn classifier) algorithm will ask what are the datapoints in the dataset as well as the labels of each one of them. As our standard practice of crating features and labels, as you have seen already from the previous projects, the features will be stored as X, and the labels as y.

X = df[['HP', 'Atk', 'Def', 'SA', 'SD', 'Spd']]
y = df[['Rating']]
Features and Labels in comparison

_. no train and test dataset?

If you followed my previous projects, it is a common practice to split the dataset into features and labels, finally training the model. In this case, it would not make sense to split the dataset into a train and a test set because the model does not have to be tuned from existing data. The model does not have to first learn how the classification works, for each new datapoint we just need to look at surrounding existing Pokemons to estimate its best label.

4. Create new Pokemon

To create a new Pokemon, I will simply need to generate new numebrs for its 6 stats. I will generate random numbers, in this case, and I will store this data in a variable called new (new Pokemon).

#create new pokemon
import random

new = [random.randrange(150, 420) for x in range(6)]
new
\
[169, 335, 279, 299, 286, 253]

5. k-nearest neighbor classifier (knn classifier)

It is now time to run the knn classifier. I will use this sklearn library to implement it, which is, at the same time, the most used and efficient machine learning library among data scientists.

The KNeighborsClassifier function will be trained on the existing dataset using fit(X, y), and for any coordinates, as input, it will identify its 5 closest points in space: our neighbors.

#knn classifier on nearest pokemons
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X, y)

6. Make a prediction

Given my new Pokemon, I will input its coordinates and it will estimate what is the most appropriate label according to its surroundings: in this case Uncommon.


#make a prediction
neigh.predict([new])
\
array(['Uncommon'], dtype=object)

***FROM TO THIS POINT YOU DO NOT NEED TO FOLLOW THE TUTORIAL ANYMORE. I am including further analysis for the most curious among you so that you can understand the depth of the algorithm. So far, what I have been writing up to point 6 is sufficient for a beginner.

7. Run analysis

With this code, I wish to understand what are the probabilities of a new Pokemon belonging to each class. In this case, all the neighbors were Uncommon, so it has a 100% chance of being Uncommon, but with different neighbors, the probability distribution is likely to change.

#run analysis, I want to convert the results into a dictionary
list_ratings = (sorted(list(y.Rating.unique())))
list_prob = neigh.predict_proba([new]).tolist()[0]
dict(zip(list_ratings, list_prob))
\
{'Common': 0.0, 'Legendary': 0.0, 'Rare': 0.0, 'Uncommon': 1.0}

However, I still wish to know what are the neighbors. With this for cycle I will print the information on each one of them, so I can analyze the surrounding data of my new Pokemon a bit better.

for el in neigh.kneighbors([new])[1].tolist()[0]:
    #display(pd.DataFrame(df.iloc[el]))
    pass

8. Graph results

To graph results, I will be using a technique called dimensionality reduction, which is quite advanced for a beginner. Basically, I have 6 columns, but for the human mind, we can only graph a maximum of 3 dimensions, 4 if we introduce a slider (representing the concept of time). I will be compressing these 6 dimensions into 2, so that I can visualize the data in a 2D graph.

Remember that if you are going to use this technique, some data will necessarily be lost in the compression process, meaning that the configuration of points in 2D will not be equivalent to the configuration of the same points in the original 6D: for example, if point 1 and 54 are closest neighbors in 6D, they won’t be necessarily closest neighbors in 2D.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2, svd_solver='auto')
pca_result = pca.fit_transform(df[['HP', 'Atk', 'Def', 'SA', 'SD', 'Spd']])
#display(df)

fig = plt.figure(figsize=(14, 8))
x_ = list(pca_result[:,0])
y_ = list(pca_result[:,1])
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(df[['HP', 'Atk', 'Def', 'SA', 'SD', 'Spd']], x=x_, y=y_, color=df[['Rating']])
#fig.update_traces(textfont_size=22)
fig.show()

This is the resulting image from the PCA.

PCA of the Pokemon stats

However, our dataset is missing one important element: where is the new Pokemon? I will copy the original dataset (it is always a good practice rather than just overwriting the initial data, in case something goes wrong you will need to restart the entire process) and add the new Pokemon at the end.

#add the new Pokemon to the dataset, so we can visualize it
df_ = df.copy()
df_.loc[251] = ['New']+[sum(new)]+new+['Unknown']
df_

The point has been added at the bottom, as you can see, and I purposely added it with an Unknown label still, so that we can distinguish it from the rest of the data points.

pca = PCA(n_components=2, svd_solver='auto')
pca_result = pca.fit_transform(df_[['HP', 'Atk', 'Def', 'SA', 'SD', 'Spd']])
#display(df)

fig = plt.figure(figsize=(14, 8))
x_ = list(pca_result[:,0])
y_ = list(pca_result[:,1])
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(df_[['HP', 'Atk', 'Def', 'SA', 'SD', 'Spd']], x=x_, y=y_, color=df_[['Rating']])
#fig.update_traces(textfont_size=22)
fig.show()

This is the final result! We can see the orange dot in the middle of the screen, surrounded by all Uncommon Pokemon.

PCA of the Pokemon stats with the New Pokemon as an Organge dot

The end?

What is next? -> How to build a Decision Tree Classifier

Did you find this guide useful? If you wish to explore more code, you can check the list of projects that will progressively help you learn data science. If you wish to have a general idea of what you are learning, we have prepared a guide that contains the list of the most important concept you will need to learn to become a Machine Learning engineer.

Join our free programming community on discord, learn how to code, and meet other experts

Processing…
Welcome to the dojo! We will send you a notification when new projects are ready for you to read!

2 thoughts on “Machine Learning for Beginners; Project 3: Pokemon k-nearest neighbors classifier

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: