There are many ways of showing data with graphs, but a histogram is special. It is similar to bar graphs and a super-fast and easy way to summarize data. You can use these powerful little charts to measure your data’s spread, variability, central tendency, and more. No matter how large your data set is, if you draw a histogram, you’ll be able to “see” what’s happening inside of it.
With histograms, the areas under the bars don’t just measure the count or frequency of the thing being measured, they also show the percentage of the entire data set being represented by individual segments.
In this tutorial, we use a car price prediction dataset from Kaggle. It has 205 rows and 26 features.
import numpy as np import pandas as pd import matplotlib.pyplot as plt df=pd.read_csv('/content/CarPrice_Assignment.csv') plt.style.use('ggplot') plt.title('Car Price') plt.hist(df['price'], edgecolor='black',bins=10,label='Car') plt.legend(loc='upper right') plt.xlabel('Price') plt.ylabel('Number of Cars') plt.tight_layout() plt.show()
The histogram does a good job of visually showing the mean, median, and standard deviation. Looking at it, you can’t see the exact figures, but you can get a sense of those numbers by looking at the shape of the Curve.
The histogram you’ve been evaluating is definitely not normally distributed. As long as there’s more than one hump, there’s no way you can call the distribution bell-shaped and that shape must have some sort of meaning. The question is, why is the distribution shaped that way? How will you find out?
Group by Histogram
There is too much data to read and understand at once, until you’ve grouped the data you don’t really know what’s in it. Start by breaking the data down into its group so you can better summarize the data.
Once you have those groups, then you can look at whatever other group statistics you consider useful. Much of the analysis consists of taking information and breaking it down into smaller, more manageable pieces.
Let’s make a bunch of histograms that describe subsets of the price data. Maybe looking at these other histograms will help you figure out what the two humps on the price histogram mean. Is there a group of cars whose price is more than the rest?
A visualization of the number of cars that fall in each category of prices will enable you to see the whole data set at once.
Inside your data are subsets of data that represent different groups If you plot the price values for each subset, you might get a bunch of different shapes.
You can make a histogram out of your entire data set, but you can also split up the data into subsets to make other histograms.
Overlay Group by Histograms
In order to make a good comparison, the two histograms need to describe the same thing. You made a bunch of histograms in the code using subsets of the same data, for example, group by ‘drivewheels’, comparing those subsets to each other made sense.
car_df = df.groupby('drivewheel') car_df.groups.keys() list(car_df.groups) plt.style.use('ggplot') plt.title('Car price group by drivewheel') plt.hist(car_df.get_group('fwd')['price'], edgecolor='black',color='yellow',rwidth=0.9,alpha=0.5,label='fwd') plt.hist(car_df.get_group('rwd')['price'], edgecolor='black',color='brown',rwidth=0.7,alpha=0.6,label='rwd') plt.hist(car_df.get_group('4wd')['price'], edgecolor='black',color='green',rwidth=0.7,alpha=0.9,label='4wd') plt.legend(loc='upper right') plt.xlabel('Price') plt.ylabel('Number of Cars') plt.tight_layout() plt.show()
There could be variation among drivewheels: for example, the price of ‘rwd’ could be on average much higher than the price of ‘fwd’. And there could be ‘fueltype’ variation, too: sedans on average get a higher price than hatchbacks or vice versa. Of course, all the data is observational, so any relationships you discover won’t necessarily be as strong as what experimental data would show.