In this article, I am going to show you step-by-step how to encode text using a word2vec model of your choice. Full code available at my repo.
Among the many employable Machine Learning algorithms and architectures, the process of converting any kind of data into a vector has become one of the most popular Machine Learning approaches ever used. This approach is what is known to be vector-based technology, and it became popular for the following reasons:
- Much less difficult to implement
- Cheaper than many other models
- Less or no tuning required
- Flexbility, the same model can be used in multiple use-cases
Mastering vector-based technology and the several libraries that are employed to make the conversion into vectors possible means that you are becoming an expert in Machine Learning (hopefully, thanks to our guides), and proves that you are already familiar with most essential Machine Learning algorithms, libraries, and projects.
What are vectors?
Vectors are very simple concepts: they are a list of one or more numbers. The way vectors are represented is very similar to a list in python, for example, the following is a vector that consists of 5 numbers.
[3, 5, 2, 8, 4]
The advantage of using a vector is that it can be placed in a Cartesian Field consisting of 5 dimensions (x, y, z, j, k). We can have vectors of different lengths without any limitations. For example, the following vector:
[5, 2, 8]
Can be plotted in a 3D space, because it only has 3 dimensions.
What are vectors used for?
Now that you know what are vectors, it is very easy to understand what vector-based technology is. Basically, we convert all of our data (the one that needs conversion), even the categorical data, into a vector format. Once we only get to work with vectors and no other kind of data, we can employ several different techniques to make the best use of this technology. Some examples are:
- Recommendation systems
- Topic modeling
- Document classification
- Customer segmentation
- Trend and keyword extraction
Using Embeddings to create vectors
You are probably asking yourself the question: how do we get to convert all our data to vectors? Numerical data is already in a vectorized form, so we don’t have to touch it. But what about categorical data? The conversion from categorical data into vectors is called encoding, and the end result of this process is called embedding, and mastering it will be the main objective of this series.
Software Engineers employ encoding models, which usually are neural networks, to convert categorical data into numerical data. To see why encoding is so important, we can look immediately at the result of this conversion:
The objective of encoding categorical data, which can be words but also sentences, images, product choices, movies, is to group similar concepts together in space. For example, one of the most popular models to convert words into vectors has been word2vec. Once the encoding process has been completed, all the words in the same region of space have the same meaning. The same applies to every trained encoder. Some images that have been encoded will be distributed in space depending on their similarity. The same applies to sentences or documents. Encoders are used to extract the meaning behind data.
The gensim library
Gensim is the most used Machine Learning library to download and manage text encoders. All text encoders you are going to use are pre-trained, meaning that are neural networks that have already been tuned on Gigabytes of data by renowned experts. It is very common to use pre-trained models in NLP. You just have to pick one from a list of models and see how it performs on your data.
Nowadays, the library has been rendered obsolete by transformer technology, which uses dynamic embeddings. Gensim, however, is still in use and is a good point to start to learn how encoding works without overcomplicating the tasks at hand.
Structuring the algorithm
Full code available at my repo. To try out this library on a text we wish to encode, we are going to follow these simple steps:
- Install gensim
- Pick model
- Download a pre-trained model
- Create a word dataset
- Encode words
- Dimensionality Reduction with PCA
- Visualize data
1. Install gensim
There is only one essential library we are going to use for this project (for full documentation click here):
!pip install gensim
2. Pick model
As you can see, there are plenty of models we can choose from. Some of them are very heavy and are used for in depth-analysis that have no time constraints (for example, when writing a research paper), while others are used for real-time applications, and need to prioritize speed on top of quality.
import gensim.downloader as api #show all available models in gensim-data list(api.info()['models'].keys()) \ ['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']
3. Download a pre-trained model
I chose one of the lightest models to ease things up. The trained model is approximately 100MB and encodes words in 25 dimensions only (25 is considered little). If you wish to go big, you can choose ‘glove-wiki-gigaword-300’, which is a 3.5GB model and encodes words in 300 dimensions.
#download model word2vec = api.load('glove-twitter-25')
4. Create a word dataset
A word encoder can only work on individual words. Then, how can we work on the more complex text, like sentences, tweets, or entire documents? The answer to this question, which we will see in future articles, is to encode each word in a corpus (a text) and take the average of all words, so that, for example, a sentence of 10 words will not be 10 vectors, but a single vector: basically, we convert a sentence to a single vector, a sentence2vec.
words = ['spider-man', 'superhero', 'thor', 'chocolate', 'candies']
5. Encode words
Now that I have a list of words, if the word is available in our word2vec model (unknown words, as you can imagine, cannot be encoded), The word2vec works as a dictionary; we input a word to get a vector as a result.
#encoding vectors = [word2vec[x] for x in words] vectors \ [array([-0.38728, -0.51797, 0.97012, 0.13815, -0.20966, -0.29899, 0.97128, -0.18348, 0.26849, -1.6371 , 0.97 , 0.70658, -1.711 , -0.59956, 0.44781, 0.54718, 0.20839, 0.39681, 0.41947, 0.58817, -1.0464 , 0.5229 , -0.52117, -1.0674 , 0.21981], dtype=float32), array([-0.19496, -0.11281, 0.61174, -0.27074, 0.50853, -0.15016, 1.4945 ...
6. Dimensionality reduction with PCA
As shown before, initially all the data has 25 dimensions (each vector is made by 25 numbers). Unfortunately, 25 dimensions cannot be visualized, as for us human beings is impossible to conceptualize more than 4 dimensions. But, thanks to the dimensionality compression technique, we will be able to visualize our data when compressed to 2 dimensions only.
from sklearn.decomposition import PCA import matplotlib.pyplot as plt pca = PCA(n_components=2, svd_solver='auto') pca_result = pca.fit_transform(df_) pca_result \ array([[-1.68674996, -0.32655628], [-1.65333308, 0.08659033], [-1.20627299, -1.09894479], [ 3.63810973, -1.08039993], [ 0.9082463 , 2.41931067]])
7. Visualize data
After compressing the data into 2 dimensions, we can visualize it using a scatterplot. I find plotly to best the best library to make graphics, mostly because it is fast and the graphs are all interactive.
fig = plt.figure(figsize=(14, 8)) x = list(pca_result[:,0]) y = list(pca_result[:,1]) # x and y given as array_like objects import plotly.express as px fig = px.scatter(x=x, y=y, text=words) fig.update_traces(textfont_size=22) fig.show()
Did you find this guide useful? If you wish to explore more code, you can check the list of projects that will progressively help you learn data science. If you wish to have a general idea of what you are learning, we have prepared a guide that contains the list of the most important concept you will need to learn to become a Machine Learning Engineer.