How to show the running bar when using an apply function on a pandas DataFrame
What is the Schrodinger’s pandas
To understand this concept you need to be familiar with both:
- Schrodinger’s cat
- Pandas library
The Schrodinger’s cat is a physics conjecture that imagines a cat inside a box with a poisonous vial that may break at any time. However, you have no idea when the vial is going to break: hence, the cat can be considered alive and dead at the same time. This is an oversimplification that is trying to make sense of quantum physics, but the principle behind it is interesting and also applicable to python programming.
The pandas library is one of the most used python libraries in data science. It allows us to work with tabular data (like a spreadsheet). Although it cannot really handle big data (after 10GB it becomes unmanageable), you can still do a lot with it, as it allows you to transform your dataset with simple commands.
The Schrodinger’s Pandas (or better yet Michelangiolo’s pandas) is a conjecture that I wish to apply to big data processes and is based on the same exact logic. The other day I had to encode 200MB of textual data with a new kind of transformer that I have never tested before (all-MiniLM-L6-v2).
I had no idea of how much time would have been required to end the process, so I started it hoping to see the results just after a few minutes. Minutes passed by without any results. Five minutes turned into ten, and ten minutes into half an hour. After seventy minutes of agony (while my PC is using all its GPU it cannot perform any other computationally intensive process, so it’s basically unusable) I decided to end the experiment, without even knowing how much data it had processed.
I then understood that when I start a process without having any idea of its progress while computing, at any time the process could have barely started or been close to its end. Because there is no way of knowing it, it can be both things at the same time.
The solution: add a tracking bar
The solution to this hideous problem is to look into the box! To be specific, the process I am referring to is a method called apply. By using this method on a DataFrame, a function is applied to each row. This is a standard way of encoding an entire dataset, for example.
!pip install tqdm
tqdm is a small library that allows you to monitor this progress with a progress bar that appears on the console:
from tqdm import tqdm tqdm.pandas() df = df.progress_apply(lambda x : model.encode(x))
This code is enough to add the progress bar: let us see what it does when it runs, this is the output:
94%|█████████ | 66886/70929 [1:28:58<13:05, 12.45it/s]
The progress bar shows you how many rows still need to be processed, how much time elapsed, and how much time is still required for completeness (an estimate).