Outlier Detection — Using PYSPARK

Rajshekar
4 min readDec 21, 2021

First of all, we need to understand the basic definitions of the below concepts :

What is an Outlier ?

An outlier is a data point that differs significantly from the other data points. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

Interquartile Range

The interquartile range is what we can use to determine if an extreme value is indeed an outlier. The interquartile range is based upon part of the five-number summary of a data set, namely the first quartile and the third quartile. The calculation of the interquartile range involves a single arithmetic operation. All that we have to do to find the interquartile range is to subtract the first quartile from the third quartile. The resulting difference tells us how spread out the middle half of our data is.

Determining Outliers

Multiplying the interquartile range (IQR) by 1.5 will give us a way to determine whether a certain value is an outlier. If we subtract 1.5 x IQR from the first quartile, any data values that are less than this number are considered outliers. Similarly, if we add 1.5 x IQR to the third quartile, any data values that are greater than this number are considered outliers.

Now we know the basic definitions, we move forward with the dataset and the code using “PYSPARK”, to detect the outliers and remove them

Dataset Source : https://archive.ics.uci.edu/ml/datasets/wholesale+customers

The above data set contains “2” categorical features and “6” numerical features., and it refers to clients of a wholesale distributor which includes the annual spending in monetary units (m.u.) on diverse product categories.

Initializing and importing the necessary libraries :

Reading the dataset from Google drive

Print the schema

Create a customized function, which accepts the Spark Data Frame as an argument.

The customized function will perform the below actions :

  1. Calculation quantile of each columns — Q1 & Q3
  2. IQR = Q3 — Q1., of each column
  3. Creating a mask to check “Q1–1.5*IQR” or “Q3+1.5*IQR”
  4. Creating a new column, named “is_outlier_{columnName}”., which shows the number of outliers in each row and column
  5. Creating another new column, named “total_outliers”., which adds the sum of each “is_outlier_{columnName}”., and gives the total outliers count for each row & column
  6. Dropping the extra columns., “is_outlier_{columnName}”., and keeping the “total_ouliers” column

Returning the spark data frame., with a new column “total_outliers”

Filtering the above dataframe., to select only those records., which has “total_outlier” ≤1., and displaying the records.

Plotting the “Box Plots”., for the original Data Frame and the new Data Frame after removing the outliers.

Kindly note that, the spark Data Frames are to be converted into pandas Data Frames., before using seaborn or Matplotlib plots.

We have identified few outliers in the spark Data Frame., and defined a customized function to check the total outliers in each row and column., and further segregated the Data Frame., to select only those records which has “0” or “1” outliers.

We could still see some outliers, even after removing the initial outliers., which shows that the data is skewed., however this is a good starting point to proceed further towards building machine learning models by converting the above data frame into Z — score values.

Hope you liked this post, and let me know in case of any suggestions or any comments on the subject.

Thank you all and happy coding

Connect to the below link for the entire code on my github.

https://github.com/Rajshekar-2021/Outlier-Detection-PYSPARK/blob/main/Customer_Data_Outliers_pyspark.ipynb

--

--