Home / Blog / Data Science / Box Plot

Box Plot

  • by Mr. Nikhil Miryala
  • February 12, 2020
  • 188

What is Box Plot?

I met one of my college friends after many years accidentally yesterday in a lounge nearby. We were meeting after many years and had to catch up on a lot of things. Finally, our discussion turned towards the future.

'Friend':
"Hey do you remember I always wished to pursue higher studies. I think this is the right time. I am planning to pursue my Masters - MBA (Master of Business Administration)"
'Me':
"Oh wow!!! Congratulations, so when are you starting and which university have you joined?"
'Friend':
" I have not yet finalized any university. I am willing to join any university, though."
'Me':
"What Universities did you shortlist?"

'Friend':
" I have selected a few. I am finding it difficult to choose. Can you help me decide?"

'Me':
" Oh sure! Share the information that you have collected!"
'Friend':
" Certainly!!"

The next day my friend shared the data of all the universities from which he wished to pursue his post-graduation program.

I did my part of the research and collected a few more details about those Universities like:

  • What sort of examination does one need to clear to apply?
  • What are the minimum marks that one would need to score to secure admission in a University?
  • What is the acceptance criteria of the University?
  • What is the ratio of students to faculty?
  • Are there any scholarship programs? If so, what is the maximum scholarship allowed?
  • What would be the minimum expenses?
  • What would be the salary package for graduates of the University? etc.

I observed a statement in one of the University websites which caught my attention. “Average Salary of students of the University is $132000 per annum”.

I had accumulated and arranged the data that was sufficient to start my analysis and had some interesting insights to help my friend. I applied statistical computations on the arranged data, the process is called as EDA “Exploratory Data Analysis”:

  • First Moment Business Decision or Measure of Central Tendency
    • Mean - Average of all the data in a column/feature
    • Median - Middle value of the data in a column/feature
    • Mode - Most frequently occurring value if the data is categorical
  • Second Moment Business Decision or Measure of Dispersion
    • Variance - Gives information on what is the dispersion in the data.
    • Standard Deviation - It overcomes the problem associated with variance.
    • Range - Difference between the maximum and the minimum value in the data.
  • Third Moment Business Decision or Skewness
  • Fourth Moment Business Decision or Kurtosis

Graphical Representation

  • Univariate – Requires one variable/feature to get a plot
    • Histogram
    • Bar plot
    • Box plot
    • QQplot
  • Bivariate -- Requires two variables to get a plot
    • Scatter Plot
  • Multivariate – Requires many variables to get a plot
    • Conditional plot

In this article, I will explain about the most helpful Univariate Graphical Representation concept called Box Plot.

What is a Box Plot?


Box plot is a graphical representation of how the values in the data are spread out. Box plot represents 100% of the data including outliers. In a few instances identifying the distribution of the data using a Histogram or any other graph may not give the desired information.

 

What is the information provided by a Box Plot?

Box Plot will provide the following information

  • Median (Q2/ 50th Percentile): The Middlemost value
  • First Quartile (Q1/25th Percentile): The middle number between the smallest number and median of the data
  • Third Quartile (Q3/75th Percentile): The middle number between the highest number and median of the data
  • Whisker: A-line extending vertically from Box. Hence, “Box Plot” is also called “Whisker Plot". Whisker represents a spread of 25% of the data (lower & upper whiskers)
  • Outliers: Any data not included between the whiskers is plotted as an outlier with a dot
  • Inter Quartile Range (IQR): This is a measure of the difference between 75th and 25th percentiles simply, IQR = Q3 – Q1
  • Minimum: (Q1-1.5*IQR)
  • Maximum: (Q3+1.5*IQR)

 

Box Plot on Normal Distribution:

Let us try and understand the Box Plot on a normal distribution and the probability density function.

Box Plot on Normal Distribution

What is a normal distribution?

The normal distribution is explained with a 68 - 95 - 99.7 rule

Points to be considered here are:

  • 68% of the data is within 1 standard deviation (σ) and of the mean (μ)
  • 95% of the data is within 2 standard deviations (σ) and of the mean (μ)
  • 99.7% of the data is with 3 standard deviations (σ) and of the mean (μ)
  • 7% of Outliers

Note: Boxplot can be constructed on non-normal data also

Let us now understand more about Boxplot

The basic format of the box plot is to use a box to convey the middle 50% of the data. This region is called as InterQuartile Range - IQR.

Several variations on the traditional “Box Plot” exist. The most common among them are Variable Width Box Plot and Notched Box Plot.

  • Variable Width Box Plot: It illustrates the size of each group of data by making the width proportional to the size of the group
  • Notched Box Plot: It applies a “notch” or narrows at the median of the box. The width of notches is proportional to the IQR of the sample and inversely proportional to the square root of the size of the sample

Math Behind Box Plot:

Consider a sample of 10 data points

11, 16, 12, 17, 14, 12, 16, 17, 13, 20

  • Order the data from smallest to largest
    11, 12, 12, 13, 14, 16, 16, 17, 17, 20
  • Find the Median

    Median is the middlemost value for odd numbers of sample data, or it is the average of two middle numbers for even numbers of sample data i.e. 11, 12, 12, 13, 14
    Q1 = 12

    The third quartile if the median of the data points to the right of the median.i.e. 16, 16, 17, 17, 20
    Q3 = 17

  • Complete the five-number summary by finding the minimum and maximum value in the dataset
    • The minimum is the smallest data point, which is 11
    • The maximum is the largest data point, which is 20
    • The five-number summary is 11, 12, 15, 17, 20

CONSTRUCTION OF A BOX PLOT USING THE ABOVE DATA

  • Mark an axis that fits the above five-number summary

    BOX PLOT AXIS
  • Draw a box from Q1 to Q3 with a vertical line through a median

    Q1 = BOX PLOT DEMARCATING MEDIAN AND QUARTILES
  • Draw a whisker from Q1 to the min and from Q3 to max
    Min = 11 and Max = 20
    Q1 = BOX PLOT DEMARCATING WHISKERS

 

Interpreting the Quartiles

 

The five-number summary divides the data into sections where each section contain 25% of the data in that set.
BOX PLOT

Since Q1=12, about 25% of the data is lower than 29 and about 75% is above 29.

 

Outliners:
If the data happens to be normally distributed, then IQR = 1.35 σ where σ is a standard deviation.

Outliers = 1.5 * IQR times more above the third quartile or below the first quartile.

Note that outliers are not necessarily always “bad”, they may be the most important and most informative part of the dataset. They should not be removed without properly verifying. Outliers are very important and require special treatment; they may be the key understudy or they may be the result of human errors.

With the help of the Box Plot, I tried to derive insights for 10 different Universities. However, I will try to explain using one feature (Salaries) from the University data. (Please note for this article I have masked data points as it contains some sensitive information).

I have used Data Science tools like R and Python to come up with these insights.

R programming:

  • Load the required packages
  • “readxl” package to read an excel file
  • “read_excel”: The function to read an excel file
  • “file.choose()”: The argument to load the dataset using GUI
  • “attach()”: Function defines the content of the object. Used to call the column name directly without referring to the table name in R
  • “names()”: Function to show the column names

 

I have created a Box Plot to identify some of the outliers in the data. If I remove them the average salary will get affected.

BOX PLOT CODE IN R LANGUAGE

The output of the box plot will look like the below image which shows that there are some outliers, which are influencing the mean calculation. BOX PLOT CONTENT

Calculation of IQR: I need Q1 and Q3 for which an inbuilt function quantile() is used to calculate percentiles 0.25 and 0.75 respectively.USING QUANTILE FUNCTION

Outliers = (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR).

Calculations of Outliners:CODE FOR OUTLIERS

Now I have outlier’s data which I am appending to the original file

APPENDING OUTLIERS DATA

 

I will share this final analysis report loaded into the file Outliers.csv with my friend so that he can now make a wise decision of choosing the right university.

 

 

 

I can achieve the same using Python Programming as well

 

 

Python Programming

  • Import the required libraries and import the dataset.

    IMPORTING PYTHON LIBRARIES

    IMPORTING PYTHON DATASETS
  • In Python, I used the seaborn library for the boxplot function

    PYTHON CODE FOR BOX PLOT

    Used swarmplot() to get a better representation of the distribution of the data. However, if the data is large then this representation would not be an ideal one.SWARMPLOT FUNCTION

    The output shows the distribution of data points along with the boxplotBOX PLOT OUTPUT

     

    We can see that there are some outliers however, we need to know what those outliers are

    In order to calculate outliers mathematically, we need to come up with IQR (Inter Quartile Range) which is IQR = Q1 (Quartile 1) – Q3 (Quartile 3) i.e. 25th and 75th percentile.


  • In python we have a function quantile() to calculate percentiles, using Q1 and Q3 it can be calculatedIMPORTING PYTHON LIBRARIES


    QUANTILE FUNCTION

 

With IQR I calculate outliers using the formula (Q1 – 1.5* IQR) and (Q3 + 1.5*IQR). The results from the python code will return as Boolean for outliers, it will print either as True or False. CODE FOR OUTLIERS

 


SALARIES OF UNIVERSITY STUDENTS AFTER BOX PLOT ANALYSIS

Conclusion : Data Science is all about communicating results to people who may not be aware of these hidden insights. From the analysis, I was able to see things that were not clear otherwise. Hopefully, the results from the statistical calculations will help my friend to choose wisely.

I would inform my friend not to fall prey to the extreme salary figures that are being quoted by a few Universities.

                         

You may also like...

demand_and_oppurtunity.gif Artificial Intelligence
June 18, 2020

As per the most recent report by KellyOCG India, it is expected that there would be a 60% Increased demand for Artificial Intelligence and Machine Learning.

Digital-MarketingT.png Digital Marketing
June 23, 2020

Several small businesses aim to decide which sort of marketing to do to hit their brand. In the picture of marketing, there are two types which include traditional marketing, and the most...

ai-specialist.png Artificial Intelligence
June 11, 2020

AI is a vast field that someone needs a...

Make an Enquiry
Call Us