What is a normal distribution?
A normal distribution is probably the most used modeling function in statistics. It works like this: we group the similar elements in the data and we count how many times they appear.
A normal distribution can show us immediately how data is distributed in a dataset. This becomes very useful when we want to know immediately which data is too close and which one is too far (because we might want to delete it), and if the data is predictable (has a low variance) or probabilistically uncertain (high variance).
Where is it used?
The majority of real data can be modeled using a normal distribution: we can think of stock data, sports scores, psychometric values… This makes it really popular, and because almost any data can correspond to this mathematical function, we can use it to run pre-determined calculations.
For example, in stock training or portfolio diversification risk is assessed using the standard deviation: this means that a stock whose returns form a spread normal distribution is considered very risky (because it can have high fluctuations), while a stock with a narrow normal distribution is considered safer.
What do we need to replicate it?
To adapt a normal distribution to real data is very simple, we can only play with 3 numbers: mean, standard deviation, and alfa.
- The mean allows the distribution to move left (lower) or right (higher)
- The standard deviation makes the distribution spread (the higher, the larger)
- The alfa curves the distribution from left (negative) to right (positive)
Coding the distribution
To code the distribution, it means to generate a final dataset of thousands/millions of samples that when graphed look like a normal distribution. Probabilistically speaking, this is how it is done following a function with our own parameters (to make the code simpler for beginners I haven’t put it into a function, but you can easily manage it):
import pandas as pd from scipy.stats import skewnorm def create_pdf(sd, mean, alfa): #invertire il segno di alfa x = skewnorm.rvs(alfa, size=1000000) def calc(k, sd, mean): return (k*sd)+mean x = calc(x, sd, mean) #standard distribution return x x = create_pdf(sd=0.1, mean=1, alfa=5)
Graphing the normal distribution
Once we have created a dataset with several points (1,000,000) randomly picked from the normal distribution, we can easily exploit the Pandas visualization API to show an histogram of our distribution:
If we wish to make the distribution skewed to the left, we can change the alfa parameter to 5: