Let’s pretend that we took a survey and we track the height of all the people who respond, it might be useful to plot those heights to get an idea of which height groups are in our sample size. How should we actually plot these, On the top of your head you might think that a bar chart would be a good idea for this but if you think about it we possibly have up to a hundred different possible heights maybe even more. If you plot it out how many responses we got from each age then that would mean you have almost a hundred different columns which definitely isn’t useful so this is where histograms come in.
Histograms are great for visualizing the distribution of data where the data falls within certain boundaries. It looks like a bar graph but a histogram groups the data up into bins instead of plotting each individual value so the best way to see what this looks like is just to take a look at some examples.
In this tutorial, we’re going to be learning how to plot out more data then than just a small list. We’ll look at a real-world example with data that I’ll load from a CSV file and also we’re going to learn how to draw overlapping histograms using Pandas and Matplotlib. Before we go and set up how we create the chart let’s look at what the end results would be.
This is what we want to achieve here. We’ve got gender males and females the heights in centimeters of 245 males and 255 females. The chart here in the center shows the distribution of males and females by height. So the light brown color plus the dark brown color here shows the distribution of females and the blue color plus the dark brown color here shows the distribution of males. So we’d like to be able to draw this but unfortunately, there are no default options in Pandas and Matplotlib. So let’s see how we go and create this very diagram here.
Here is the data that I’m going to be using for this tutorial, right now I just have a list of Height of Male and females.
That’s basically what these histogram plots are used for we can use these for dropping our data into these different bins and see how many values fall into these certain bins so that’s what you would use a histogram for.
First a quick look at the categories in our binary field by using ‘values_counts()’ to see the counts of each unique category:
import pandas as pd import matplotlib.pyplot as plt csv_file='/content/500_Person_Gender_Height_Weight_Index.csv' df=pd.read_csv(csv_file) df.dropna() df['Gender'].value_counts()
To compare two scale variables, one option is to overlay two histograms on each other. The example will use as a binary field the ‘Gender’ and ‘Height’ as a scale field. We need to separate the scores for each category. We can create a list with booleans (true/false) for each category.
male =df['Gender'] == 'Male' female = df['Gender']== 'Female'
These we can use to select the scores of each category and store them separately. Then we’ll create a separate histogram for each category using the ‘hist‘ function from Pyplot.
plt.style.use('ggplot') plt.title('Height of Male and Female') plt.hist(df['Height'][male], edgecolor='black',color='blue',bins=10,label='Male') plt.hist(df['Height'][female], edgecolor='black',color='brown',bins=10,label='Female') plt.legend(loc='upper right') plt.xlabel('Height') plt.ylabel('Number of People') plt.tight_layout() plt.show()
You can pass bins manually and explicitly. When we specify bins we can either pass in an integer or a list of values if we pass in an integer then it will just make that number of bins and divide our data into those accordingly. For example bins=5 then this will divide all of these ages up into 5 different bins and then tell us how many people fell into those age ranges.
We can also pass in our own list of values and those values will be the bins. You have more control over the exact values.
We’re already almost there with our overlapping histogram but the trouble is we’ve cut the male data overlapping on the female data here we can’t see the hidden female data in behind so they’re not quite overlapping yet one is just on top of the other so in order to do the changing opacity.
To make the bin values more clear let’s change the alpha value, alpha value is used to change the transparency level of the bars.
plt.hist(df['Height'][male], edgecolor='black',color='blue',rwidth=0.9,alpha=0.5,label='Male') plt.hist(df['Height'][female], edgecolor='black',color='brown',rwidth=0.7,alpha=0.5,label='Female')
Now we can see the bin boundaries more clearly.