What is Box Plot?
I met one of my college friends after many years accidentally yesterday in a lounge nearby. We were meeting after many years and had to catch up on a lot of things. Finally, our discussion turned towards the future.
"Hey do you remember I always wished to pursue higher studies. I think this is the right time. I am planning to pursue my Masters - MBA (Master of Business Administration)"
"Oh wow!!! Congratulations, so when are you starting and which university have you joined?"
" I have not yet finalized any university. I am willing to join any university, though."
"What Universities did you shortlist?"
" I have selected a few. I am finding it difficult to choose. Can you help me decide?"
" Oh sure! Share the information that you have collected!"
The next day my friend shared the data of all the universities from which he wished to pursue his post-graduation program.
I did my part of the research and collected a few more details about those Universities like:
- What sort of examination does one need to clear to apply?
- What are the minimum marks that one would need to score to secure admission in a University?
- What is the acceptance criteria of the University?
- What is the ratio of students to faculty?
- Are there any scholarship programs? If so, what is the maximum scholarship allowed?
- What would be the minimum expenses?
- What would be the salary package for graduates of the University? etc.
I observed a statement in one of the University websites which caught my attention. “Average Salary of students of the University is $132000 per annum”.
I had accumulated and arranged the data that was sufficient to start my analysis and had some interesting insights to help my friend. I applied statistical computations on the arranged data, the process is called as EDA “Exploratory Data Analysis”:
- First Moment Business Decision or Measure of Central Tendency
- Mean - Average of all the data in a column/feature
- Median - Middle value of the data in a column/feature
- Mode - Most frequently occurring value if the data is categorical
- Second Moment Business Decision or Measure of Dispersion
- Variance - Gives information on what is the dispersion in the data.
- Standard Deviation - It overcomes the problem associated with variance.
- Range - Difference between the maximum and the minimum value in the data.
- Third Moment Business Decision or Skewness
- Fourth Moment Business Decision or Kurtosis
- Univariate – Requires one variable/feature to get a plot
- Bar plot
- Box plot
- Bivariate -- Requires two variables to get a plot
- Scatter Plot
- Multivariate – Requires many variables to get a plot
- Conditional plot
In this article, I will explain about the most helpful Univariate Graphical Representation concept called Box Plot.
What is a Box Plot?
Box plot is a graphical representation of how the values in the data are spread out. Box plot represents 100% of the data including outliers. In a few instances identifying the distribution of the data using a Histogram or any other graph may not give the desired information.
What is the information provided by a Box Plot?
Box Plot will provide the following information
- Median (Q2/ 50th Percentile): The Middlemost value
- First Quartile (Q1/25th Percentile): The middle number between the smallest number and median of the data
- Third Quartile (Q3/75th Percentile): The middle number between the highest number and median of the data
- Whisker: A-line extending vertically from Box. Hence, “Box Plot” is also called “Whisker Plot". Whisker represents a spread of 25% of the data (lower & upper whiskers)
- Outliers: Any data not included between the whiskers is plotted as an outlier with a dot
- Inter Quartile Range (IQR): This is a measure of the difference between 75th and 25th percentiles simply, IQR = Q3 – Q1
- Minimum: (Q1-1.5*IQR)
- Maximum: (Q3+1.5*IQR)
Box Plot on Normal Distribution:
Let us try and understand the Box Plot on a normal distribution and the probability density function.
Box Plot on Normal Distribution
What is a normal distribution?
The normal distribution is explained with a 68 - 95 - 99.7 rule
Points to be considered here are:
- 68% of the data is within 1 standard deviation (σ) and of the mean (μ)
- 95% of the data is within 2 standard deviations (σ) and of the mean (μ)
- 99.7% of the data is with 3 standard deviations (σ) and of the mean (μ)
- 7% of Outliers
Note: Boxplot can be constructed on non-normal data also
Let us now understand more about Boxplot
The basic format of the box plot is to use a box to convey the middle 50% of the data. This region is called as InterQuartile Range - IQR.
Several variations on the traditional “Box Plot” exist. The most common among them are Variable Width Box Plot and Notched Box Plot.
- Variable Width Box Plot: It illustrates the size of each group of data by making the width proportional to the size of the group
- Notched Box Plot: It applies a “notch” or narrows at the median of the box. The width of notches is proportional to the IQR of the sample and inversely proportional to the square root of the size of the sample
Math Behind Box Plot:
Consider a sample of 10 data points
11, 16, 12, 17, 14, 12, 16, 17, 13, 20
- Order the data from smallest to largest
11, 12, 12, 13, 14, 16, 16, 17, 17, 20
- Find the Median
Median is the middlemost value for odd numbers of sample data, or it is the average of two middle numbers for even numbers of sample data i.e. 11, 12, 12, 13, 14
Q1 = 12
The third quartile if the median of the data points to the right of the median.i.e. 16, 16, 17, 17, 20
Q3 = 17
- Complete the five-number summary by finding the minimum and maximum value in the dataset
- The minimum is the smallest data point, which is 11
- The maximum is the largest data point, which is 20
- The five-number summary is 11, 12, 15, 17, 20
CONSTRUCTION OF A BOX PLOT USING THE ABOVE DATA
- Mark an axis that fits the above five-number summary
BOX PLOT AXIS
- Draw a box from Q1 to Q3 with a vertical line through a median
Q1 = BOX PLOT DEMARCATING MEDIAN AND QUARTILES
- Draw a whisker from Q1 to the min and from Q3 to max
Min = 11 and Max = 20
Q1 = BOX PLOT DEMARCATING WHISKERS
Interpreting the Quartiles
The five-number summary divides the data into sections where each section contain 25% of the data in that set.
Since Q1=12, about 25% of the data is lower than 29 and about 75% is above 29.
If the data happens to be normally distributed, then IQR = 1.35 σ where σ is a standard deviation.
Outliers = 1.5 * IQR times more above the third quartile or below the first quartile.
Note that outliers are not necessarily always “bad”, they may be the most important and most informative part of the dataset. They should not be removed without properly verifying. Outliers are very important and require special treatment; they may be the key understudy or they may be the result of human errors.
With the help of the Box Plot, I tried to derive insights for 10 different Universities. However, I will try to explain using one feature (Salaries) from the University data. (Please note for this article I have masked data points as it contains some sensitive information).
I have used Data Science tools like R and Python to come up with these insights.
- Load the required packages
- “readxl” package to read an excel file
- “read_excel”: The function to read an excel file
- “file.choose()”: The argument to load the dataset using GUI
- “attach()”: Function defines the content of the object. Used to call the column name directly without referring to the table name in R
- “names()”: Function to show the column names
I have created a Box Plot to identify some of the outliers in the data. If I remove them the average salary will get affected.
BOX PLOT CODE IN R LANGUAGE
The output of the box plot will look like the below image which shows that there are some outliers, which are influencing the mean calculation. BOX PLOT CONTENT
Calculation of IQR: I need Q1 and Q3 for which an inbuilt function quantile() is used to calculate percentiles 0.25 and 0.75 respectively.USING QUANTILE FUNCTION
Outliers = (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR).
Calculations of Outliners:CODE FOR OUTLIERS
Now I have outlier’s data which I am appending to the original file
APPENDING OUTLIERS DATA
I will share this final analysis report loaded into the file Outliers.csv with my friend so that he can now make a wise decision of choosing the right university.
I can achieve the same using Python Programming as well
- Import the required libraries and import the dataset.
IMPORTING PYTHON LIBRARIES
IMPORTING PYTHON DATASETS
- In Python, I used the seaborn library for the boxplot function
PYTHON CODE FOR BOX PLOT
Used swarmplot() to get a better representation of the distribution of the data. However, if the data is large then this representation would not be an ideal one.SWARMPLOT FUNCTION
The output shows the distribution of data points along with the boxplotBOX PLOT OUTPUT
We can see that there are some outliers however, we need to know what those outliers are
In order to calculate outliers mathematically, we need to come up with IQR (Inter Quartile Range) which is IQR = Q1 (Quartile 1) – Q3 (Quartile 3) i.e. 25th and 75th percentile.
- In python we have a function quantile() to calculate percentiles, using Q1 and Q3 it can be calculatedIMPORTING PYTHON LIBRARIES
With IQR I calculate outliers using the formula (Q1 – 1.5* IQR) and (Q3 + 1.5*IQR). The results from the python code will return as Boolean for outliers, it will print either as True or False. CODE FOR OUTLIERS
SALARIES OF UNIVERSITY STUDENTS AFTER BOX PLOT ANALYSIS
Conclusion : Data Science is all about communicating results to people who may not be aware of these hidden insights. From the analysis, I was able to see things that were not clear otherwise. Hopefully, the results from the statistical calculations will help my friend to choose wisely.
I would inform my friend not to fall prey to the extreme salary figures that are being quoted by a few Universities.
You may also like...
As per the most recent report by KellyOCG India, it is expected that there would be a 60% Increased demand for Artificial Intelligence and Machine Learning.
AI is a vast field that someone needs a...