Percentiles, Quantiles and Plots

Understand the percentiles and quantiles, then the plots using these concepts.

Rrohan.Arrora
4 min readApr 14, 2020
Image by hudsoncrafted from Pixabay

— What are percentiles?

Consider an array consisting of 100 sorted values as shown below.

array Aᵢ= [A₁, A₂, A₃, A₄, A₅, A₆, A₇, A₈, A₉, A₁₀, ………, A₉₉, A₁₀₀]

The median of the above array is the average of A₅₀ & A₅₁ values. The 50ᵗʰ percentile of the above array is actually the median of the array and it is equal to A₅₀. The 90th percentile of the above-sorted array is A₉₀. Similarly, the 20th percentile of the above-sorted array is A₂₀.

Therefore, once the values are sorted, then the values at the index become the percentile values. But, the question is, what does percentile tell about the data?

Let us understand the idea behind the percentiles with an example. Consider that you are working in any e-commerce company like Flipkart, Amazon or Myntra. You are a delivery manager and you want to know the average days taken for the delivery of the products.

Let the 95ᵗʰ percentile value = 4 and 99ᵗʰ percentile value = 5.4.

⇒ 95 percent of your deliveries are taking 4 days on average and 99 percent of the deliveries to customers are taking 5.4 days. Now, if you want to improve, then you need to fasten the deliveries by 1% in order to deliver the customer their packages before 6 days time limit.

The same goes for your sore in competitive exams as well. If your percentile is 99%, then that means that 99% of the score of the people appearing in the exam is less than you.

This is how percentiles help in evaluating the data.

— What are quantiles?

As the name suggests,

0ᵗʰ, 25ᵗʰ, 50ᵗʰ, 75ᵗʰ percentile values or 25ᵗʰ, 50ᵗʰ, 75ᵗʰ, 100ᵗʰ percentile values are known as quantiles.

— Mean Absolute Deviation

Let the median of input datasets is m . The dataset consists of 100 points as A₁, A₂, A₃, A₄, A₅, A₆, A₇, A₈, A₉, A₁₀, ………, A₉₉, A₁₀₀. Then MAD is defined as

       100
median(Σ |Aᵢ-m|)
n=1

We could take mean as well but mean is very sensitive to outliers, therefore MAD is more reliable. This concept is similar to the traditional standard deviation.

— Inter Quartile Range(IQR)

Let the 75ᵗʰ percentile value be P₇₅ and 25ᵗʰ percentile value is P₂₅.

IQR = P₇₅ - P₂₅

50% of the values lie in this IQR range.

— Box plot and Whiskers

Though histograms are very valuable in telling the density of the points, they cannot give us the percentile values for the given data. Of course, we can get percentile estimates from CDF plots but there are some other plots, which tell more accurately with a gob of other things.

Box plots tell very accurately about the percentiles. Let us understand these plots.

Box plots are called so because of their shape. They represent the percentile values for the dataset in the form of boxes.

Areas of the box plot as described by SeaBorn.

Box plot analysis on the very famous iris dataset.

Box plot on the iris dataset.
  • The first box plot tells the percentile values of the sepal length feature for three different ranges of iris flowers. 25ᵗʰ, 50ᵗʰ, 75ᵗʰ percentile values for Setosa flower when sepal length is taken as a feature are 4.8, 5, 5.25(approx).
  • When petal length is considered as the feature, 25ᵗʰ, 50ᵗʰ, 75ᵗʰ percentile values for Virginica flower are 1.7, 2.0, 2.3(approx). This tells that 25% of Virginica flowers have petal lengths smaller than 1.7. Seeing the plot, we can also say that if petal length > 2 and < 5, then it is Versicolor. If petal length < 2, then it is setosa and otherwise Virginica. At the same time, 25% of the Virginica flowers have petal length < 5, this means if I use the above if-else, then 25% of the Virginical flowers will be labelled wrong.

This way, we can see percentile values for others as well.

Whiskers

Box plot with whisker

There is no standard way to define whiskers. Whisker generally represents the max and min value of the feature.

— Violin plots

Violing plots on Iris dataset.

A violin plot plays a similar role as a box and whisker plot. These plots shows the probability distribution functions on the sides and box plots inside them. You may notice the curve like boundaries which are actually probability distribution function. It is actually merging the histogram plots and box plots.

Violin plots typically are more informative as compared to the box plots as violin plots also represent the underlying distribution of the data in addition to the statistical summary.

— Wind Ups

I hope you may have understood the various concepts illustrated above. All the above concepts are basically required for univariate analysis on any dataset.

--

--