Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Data Science / Box Plot
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Content
Yesterday, I unintentionally ran upon a former classmate in a neighbouring club. We had a lot to catch up on as we hadn't seen one other in a long time. Our conversation finally moved to the future.
'Friend': " I have selected a few. I am finding it difficult to choose. Can you help me decide?"
I did my part of the research and collected a few more details about those Universities like:
One of the university webpages included a message that drew my eye. "Students at the University earn an average of $132000 annually,"
Click here to explore 360DigiTMG.
I had gathered and organised the data necessary to begin my research, and I also had some intriguing thoughts to share with my buddy.
First Moment Business Decision or Measure of Central Tendency
In this article, I will explain about the most helpful Univariate Graphical Representation concept called Box Plot.
Box plot is a graphical representation of how the values in the data are spread out. Bo
Click here to learn Best Data Science Course in Hyderabad
What is the information provided by a Box Plot?
Box Plot will provide the following information
Let us try and understand the Box Plot on a normal distribution and the probability density function.
Box Plot on Normal Distribution
What is a normal distribution?
The normal distribution is explained with a 68 - 95 - 99.7 rule
Points to be considered here are:
Note: can be constructed on non-normal data also
Let us now understand more about Boxplot
The basic format of the box plot is to use a box to convey the middle 50% of the data. This region is called as InterQuartile Range - IQR.
Several variations on the traditional “Box Plot” exist. The most common among them are Variable Width Box Plot and Notched Box Plot.
Consider a sample of 10 data points
11, 16, 12, 17, 14, 12, 16, 17, 13, 20
Median is the middlemost value for odd numbers of sample data, or it is the average of two middle numbers for even numbers of sample data i.e. 11, 12, 12, 13, 14 Q1 = 12
The third quartile if the median of the data points to the right of the median.i.e. 16, 16, 17, 17, 20 Q3 = 17
The five-number summary divides the data into sections where each section contain 25% of the data in that set. BOX PLOT Since Q1=12, about 25% of the data is lower than 29 and about 75% is above 29.
Outliners: If the data happens to be normally distributed, then IQR = 1.35 σ where σ is a standard deviation.
Outliers = 1.5 * IQR times more above the third quartile or below the first quartile.
Note that outliers are not necessarily always “bad”, they may be the most important and most informative part of the dataset. They should not be removed without properly verifying. Outliers are very important and require special treatment; they may be the key understudy or they may be the result of human errors.
Best Data Science Course in Bangalore
With the help of the Box Plot, I tried to derive insights for 10 different Universities. However, I will try to explain using one feature (Salaries) from the University data. (Please note for this article I have masked data points as it contains some sensitive information).
I have used Data Science tools like R and Python to come up with these insights.
I have created a Box Plot to identify some of the outliers in the data. If I remove them the average salary will get affected.
BOX PLOT CODE IN R LANGUAGE
The output of the box plot will look like the below image which shows that there are some outliers, which are influencing the mean calculation. BOX PLOT CONTENT
Calculation of IQR: I need Q1 and Q3 for which an inbuilt function quantile() is used to calculate percentiles 0.25 and 0.75 respectively.USING QUANTILE FUNCTION
Outliers = (Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR).
Calculations of Outliners:CODE FOR OUTLIERS
Now I have outlier’s data which I am appending to the original file APPENDING OUTLIERS DATA
I will share this final analysis report loaded into the file Outliers.csv with my friend so that he can now make a wise decision of choosing the right university.
I can achieve the same using Python Programming as well
Watch Free Videos on Youtube
Python Programming
Used swarmplot() to get a better representation of the distribution of the data. However, if the data is large then this representation would not be an ideal one.SWARMPLOT FUNCTION
The output shows the distribution of data points along with the boxplotBOX PLOT OUTPUT
We can see that there are some outliers however, we need to know what those outliers are
In order to calculate outliers mathematically, we need to come up with IQR (Inter Quartile Range) which is IQR = Q1 (Quartile 1) – Q3 (Quartile 3) i.e. 25th and 75th percentile.
With IQR I calculate outliers using the formula (Q1 – 1.5* IQR) and (Q3 + 1.5*IQR). The results from the python code will return as Boolean for outliers, it will print either as True or False. CODE FOR OUTLIERS
SALARIES OF UNIVERSITY STUDENTS AFTER BOX PLOT ANALYSIS
Conclusion : Data science is all about sharing findings with audiences that might not be familiar with these undiscovered ideas. I was able to understand things from the analysis that I would not have otherwise. The outcomes of the statistical computations should guide my friend's decision-making.
I would warn my buddy not to be duped by the inflated salaries that certain universities are quoting.
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
360DigiTMG - Data Analytics, Data Science Course Training Hyderabad
2-56/2/19, 3rd floor, Vijaya Towers, near Meridian School, Ayyappa Society Rd, Madhapur, Hyderabad, Telangana 500081
099899 94319
Didn’t receive OTP? Resend
Let's Connect! Please share your details here