this function will run univariate outlier analysis based on boxplot or SD method. The function returns the summary of oultlier for selected numeric features and adding new features if there is any outliers

ExpOutliers(
  data,
  varlist = NULL,
  method = "boxplot",
  treatment = NULL,
  capping = c(0.05, 0.95),
  outflag = FALSE
)

Arguments

data

dataframe or matrix

varlist

list of numeric variable to perform the univariate outlier analysis

method

detect outlier method boxplot or NxStDev (where N is 1 or 2 or 3 std deviations, like 1xStDev or 2xStDev or 3xStDev)

treatment

treating outlier value by mean or median. default NULL

capping

default LL = 0.05 & UL = 0.95cap the outlier value by replacing those observations outside the lower limit with the value of 5th percentile and above the upper limit, with the value of 95th percentile value

outflag

add extreme value flag variable into output data

Value

Outlier summary includes

  • Num of outliers is Number of outlier in each variable

  • Lower bound is Q1 minus 1.5x IQR for boxplot; Mean minus 3x StdDev for Standard Deviation method

  • Upper bound is Q3 plus 1.5x IQR for boxplot; Mean plus 3x StdDev for Standard Deviation method

  • Lower cap is Lower percentile capping value

  • Upper cap is Upper percentile capping value

Details

this function provides both summary of the outlier variable and data

Univariate outlier analysis method

  • boxplot is If a data value are below (Q1 minus 1.5x IQR) or boxplot lower whisker or above (Q3 plus 1.5x IQR) or boxplot upper whisker then those points are flaged as outlier value

  • Standard Deviation is If a data distribution is approximately normal then about 68 percent of the data values lie within one standard deviation of the mean and about 95 percent are within two standard deviations, and about 99.7 percent lie within three standard deviations. If any data point that is more than 3 times the standard deviation, then those points are flaged as outlier value

Examples

ExpOutliers(mtcars, varlist = c("mpg","disp","wt", "qsec"), method = 'BoxPlot', capping = c(0.1, 0.9), outflag = TRUE)
#> $outlier_summary #> Category mpg disp wt qsec #> 1 Lower cap : 0.1 14.34 80.61 1.9555 15.534 #> 2 Upper cap : 0.9 30.09 396 4.0475 19.99 #> 3 Lower bound 4.36 -186.94 1.04 13.88 #> 4 Upper bound 33.86 633.76 5.15 21.91 #> 5 Num of outliers 1 0 3 1 #> 6 Lower outlier case #> 7 Upper outlier case 20 15,16,17 9 #> 8 Mean before 20.09 230.72 3.22 17.85 #> 9 Mean after 19.65 230.72 3 17.69 #> 10 Median before 19.2 196.3 3.325 17.71 #> 11 Median after 19.2 196.3 3.19 17.6 #> #> $outlier_data #> mpg cyl disp hp drat wt qsec vs am gear carb out_cap_mpg out_cap_wt #> 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.00 2.6200 #> 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.00 2.8750 #> 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.80 2.3200 #> 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.40 3.2150 #> 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.70 3.4400 #> 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.10 3.4600 #> 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14.30 3.5700 #> 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.40 3.1900 #> 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 22.80 3.1500 #> 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 19.20 3.4400 #> 11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 17.80 3.4400 #> 12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 16.40 4.0700 #> 13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 17.30 3.7300 #> 14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.20 3.7800 #> 15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 10.40 4.0475 #> 16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 10.40 4.0475 #> 17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 14.70 4.0475 #> 18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 32.40 2.2000 #> 19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 30.40 1.6150 #> 20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 30.09 1.8350 #> 21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21.50 2.4650 #> 22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.50 3.5200 #> 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.20 3.4350 #> 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 13.30 3.8400 #> 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 19.20 3.8450 #> 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27.30 1.9350 #> 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 26.00 2.1400 #> 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 30.40 1.5130 #> 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 15.80 3.1700 #> 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 19.70 2.7700 #> 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.00 3.5700 #> 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 21.40 2.7800 #> mpg cyl disp hp drat wt qsec vs am gear carb out_cap_mpg out_cap_wt #> out_cap_qsec out_flag_mpg out_flag_wt out_flag_qsec #> 1: 16.46 0 0 0 #> 2: 17.02 0 0 0 #> 3: 18.61 0 0 0 #> 4: 19.44 0 0 0 #> 5: 17.02 0 0 0 #> 6: 20.22 0 0 0 #> 7: 15.84 0 0 0 #> 8: 20.00 0 0 0 #> 9: 19.99 0 0 1 #> 10: 18.30 0 0 0 #> 11: 18.90 0 0 0 #> 12: 17.40 0 0 0 #> 13: 17.60 0 0 0 #> 14: 18.00 0 0 0 #> 15: 17.98 0 1 0 #> 16: 17.82 0 1 0 #> 17: 17.42 0 1 0 #> 18: 19.47 0 0 0 #> 19: 18.52 0 0 0 #> 20: 19.90 1 0 0 #> 21: 20.01 0 0 0 #> 22: 16.87 0 0 0 #> 23: 17.30 0 0 0 #> 24: 15.41 0 0 0 #> 25: 17.05 0 0 0 #> 26: 18.90 0 0 0 #> 27: 16.70 0 0 0 #> 28: 16.90 0 0 0 #> 29: 14.50 0 0 0 #> 30: 15.50 0 0 0 #> 31: 14.60 0 0 0 #> 32: 18.60 0 0 0 #> out_cap_qsec out_flag_mpg out_flag_wt out_flag_qsec #> #> $outlier_index #> $outlier_index$upper_out_index #> $outlier_index$upper_out_index$mpg #> [1] 20 #> #> $outlier_index$upper_out_index$disp #> numeric(0) #> #> $outlier_index$upper_out_index$wt #> [1] 15 16 17 #> #> $outlier_index$upper_out_index$qsec #> [1] 9 #> #> #> $outlier_index$lower_out_index #> $outlier_index$lower_out_index$mpg #> numeric(0) #> #> $outlier_index$lower_out_index$disp #> numeric(0) #> #> $outlier_index$lower_out_index$wt #> numeric(0) #> #> $outlier_index$lower_out_index$qsec #> numeric(0) #> #> #>
ExpOutliers(mtcars, varlist = c("mpg","disp","wt", "qsec"), method = '2xStDev', capping = c(0.1, 0.9), outflag = TRUE)
#> $outlier_summary #> Category mpg disp wt qsec #> 1 Lower cap : 0.1 14.34 80.61 1.9555 15.534 #> 2 Upper cap : 0.9 30.09 396 4.0475 19.99 #> 3 Lower bound 8.04 -17.16 1.26 14.27 #> 4 Upper bound 32.14 478.6 5.17 21.42 #> 5 Num of outliers 2 0 3 1 #> 6 Lower outlier case #> 7 Upper outlier case 18,20 15,16,17 9 #> 8 Mean before 20.09 230.72 3.22 17.85 #> 9 Mean after 19.22 230.72 3 17.69 #> 10 Median before 19.2 196.3 3.325 17.71 #> 11 Median after 18.95 196.3 3.19 17.6 #> #> $outlier_data #> mpg cyl disp hp drat wt qsec vs am gear carb out_cap_mpg out_cap_wt #> 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.00 2.6200 #> 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.00 2.8750 #> 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.80 2.3200 #> 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.40 3.2150 #> 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.70 3.4400 #> 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.10 3.4600 #> 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14.30 3.5700 #> 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.40 3.1900 #> 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 22.80 3.1500 #> 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 19.20 3.4400 #> 11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 17.80 3.4400 #> 12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 16.40 4.0700 #> 13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 17.30 3.7300 #> 14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.20 3.7800 #> 15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 10.40 4.0475 #> 16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 10.40 4.0475 #> 17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 14.70 4.0475 #> 18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 30.09 2.2000 #> 19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 30.40 1.6150 #> 20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 30.09 1.8350 #> 21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21.50 2.4650 #> 22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.50 3.5200 #> 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.20 3.4350 #> 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 13.30 3.8400 #> 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 19.20 3.8450 #> 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27.30 1.9350 #> 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 26.00 2.1400 #> 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 30.40 1.5130 #> 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 15.80 3.1700 #> 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 19.70 2.7700 #> 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.00 3.5700 #> 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 21.40 2.7800 #> mpg cyl disp hp drat wt qsec vs am gear carb out_cap_mpg out_cap_wt #> out_cap_qsec out_flag_mpg out_flag_wt out_flag_qsec #> 1: 16.46 0 0 0 #> 2: 17.02 0 0 0 #> 3: 18.61 0 0 0 #> 4: 19.44 0 0 0 #> 5: 17.02 0 0 0 #> 6: 20.22 0 0 0 #> 7: 15.84 0 0 0 #> 8: 20.00 0 0 0 #> 9: 19.99 0 0 1 #> 10: 18.30 0 0 0 #> 11: 18.90 0 0 0 #> 12: 17.40 0 0 0 #> 13: 17.60 0 0 0 #> 14: 18.00 0 0 0 #> 15: 17.98 0 1 0 #> 16: 17.82 0 1 0 #> 17: 17.42 0 1 0 #> 18: 19.47 1 0 0 #> 19: 18.52 0 0 0 #> 20: 19.90 1 0 0 #> 21: 20.01 0 0 0 #> 22: 16.87 0 0 0 #> 23: 17.30 0 0 0 #> 24: 15.41 0 0 0 #> 25: 17.05 0 0 0 #> 26: 18.90 0 0 0 #> 27: 16.70 0 0 0 #> 28: 16.90 0 0 0 #> 29: 14.50 0 0 0 #> 30: 15.50 0 0 0 #> 31: 14.60 0 0 0 #> 32: 18.60 0 0 0 #> out_cap_qsec out_flag_mpg out_flag_wt out_flag_qsec #> #> $outlier_index #> $outlier_index$upper_out_index #> $outlier_index$upper_out_index$mpg #> [1] 18 20 #> #> $outlier_index$upper_out_index$disp #> numeric(0) #> #> $outlier_index$upper_out_index$wt #> [1] 15 16 17 #> #> $outlier_index$upper_out_index$qsec #> [1] 9 #> #> #> $outlier_index$lower_out_index #> $outlier_index$lower_out_index$mpg #> numeric(0) #> #> $outlier_index$lower_out_index$disp #> numeric(0) #> #> $outlier_index$lower_out_index$wt #> numeric(0) #> #> $outlier_index$lower_out_index$qsec #> numeric(0) #> #> #>
# Mean imputation or 5th percentile or 95th percentile value capping ExpOutliers(mtcars, varlist = c("mpg","disp","wt", "qsec"), method = 'BoxPlot', treatment = "mean", capping = c(0.05, 0.95), outflag = TRUE)
#> $outlier_summary #> Category mpg disp wt qsec #> 1 Lower cap : 0.05 11.995 77.35 1.736 15.0455 #> 2 Upper cap : 0.95 31.3 449 5.29275 20.1045 #> 3 Lower bound 4.36 -186.94 1.04 13.88 #> 4 Upper bound 33.86 633.76 5.15 21.91 #> 5 Num of outliers 1 0 3 1 #> 6 Lower outlier case #> 7 Upper outlier case 20 15,16,17 9 #> 8 Mean before 20.09 230.72 3.22 17.85 #> 9 Mean after 19.65 230.72 3 17.69 #> 10 Median before 19.2 196.3 3.325 17.71 #> 11 Median after 19.2 196.3 3.19 17.6 #> #> $outlier_data #> mpg cyl disp hp drat wt qsec vs am gear carb out_cap_mpg out_cap_wt #> 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0 2.62000 #> 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0 2.87500 #> 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 2.32000 #> 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4 3.21500 #> 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.7 3.44000 #> 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.1 3.46000 #> 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14.3 3.57000 #> 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.4 3.19000 #> 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 22.8 3.15000 #> 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 19.2 3.44000 #> 11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 17.8 3.44000 #> 12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 16.4 4.07000 #> 13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 17.3 3.73000 #> 14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.2 3.78000 #> 15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 10.4 5.29275 #> 16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 10.4 5.29275 #> 17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 14.7 5.29275 #> 18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 32.4 2.20000 #> 19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 30.4 1.61500 #> 20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 31.3 1.83500 #> 21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21.5 2.46500 #> 22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.5 3.52000 #> 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.2 3.43500 #> 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 13.3 3.84000 #> 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 19.2 3.84500 #> 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27.3 1.93500 #> 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 26.0 2.14000 #> 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 30.4 1.51300 #> 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 15.8 3.17000 #> 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 19.7 2.77000 #> 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.0 3.57000 #> 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 21.4 2.78000 #> mpg cyl disp hp drat wt qsec vs am gear carb out_cap_mpg out_cap_wt #> out_cap_qsec out_flag_mpg out_flag_wt out_flag_qsec out_imp_mpg out_imp_wt #> 1: 16.4600 0 0 0 21.00 2.620 #> 2: 17.0200 0 0 0 21.00 2.875 #> 3: 18.6100 0 0 0 22.80 2.320 #> 4: 19.4400 0 0 0 21.40 3.215 #> 5: 17.0200 0 0 0 18.70 3.440 #> 6: 20.2200 0 0 0 18.10 3.460 #> 7: 15.8400 0 0 0 14.30 3.570 #> 8: 20.0000 0 0 0 24.40 3.190 #> 9: 20.1045 0 0 1 22.80 3.150 #> 10: 18.3000 0 0 0 19.20 3.440 #> 11: 18.9000 0 0 0 17.80 3.440 #> 12: 17.4000 0 0 0 16.40 4.070 #> 13: 17.6000 0 0 0 17.30 3.730 #> 14: 18.0000 0 0 0 15.20 3.780 #> 15: 17.9800 0 1 0 10.40 3.000 #> 16: 17.8200 0 1 0 10.40 3.000 #> 17: 17.4200 0 1 0 14.70 3.000 #> 18: 19.4700 0 0 0 32.40 2.200 #> 19: 18.5200 0 0 0 30.40 1.615 #> 20: 19.9000 1 0 0 19.65 1.835 #> 21: 20.0100 0 0 0 21.50 2.465 #> 22: 16.8700 0 0 0 15.50 3.520 #> 23: 17.3000 0 0 0 15.20 3.435 #> 24: 15.4100 0 0 0 13.30 3.840 #> 25: 17.0500 0 0 0 19.20 3.845 #> 26: 18.9000 0 0 0 27.30 1.935 #> 27: 16.7000 0 0 0 26.00 2.140 #> 28: 16.9000 0 0 0 30.40 1.513 #> 29: 14.5000 0 0 0 15.80 3.170 #> 30: 15.5000 0 0 0 19.70 2.770 #> 31: 14.6000 0 0 0 15.00 3.570 #> 32: 18.6000 0 0 0 21.40 2.780 #> out_cap_qsec out_flag_mpg out_flag_wt out_flag_qsec out_imp_mpg out_imp_wt #> out_imp_qsec #> 1: 16.46 #> 2: 17.02 #> 3: 18.61 #> 4: 19.44 #> 5: 17.02 #> 6: 20.22 #> 7: 15.84 #> 8: 20.00 #> 9: 17.69 #> 10: 18.30 #> 11: 18.90 #> 12: 17.40 #> 13: 17.60 #> 14: 18.00 #> 15: 17.98 #> 16: 17.82 #> 17: 17.42 #> 18: 19.47 #> 19: 18.52 #> 20: 19.90 #> 21: 20.01 #> 22: 16.87 #> 23: 17.30 #> 24: 15.41 #> 25: 17.05 #> 26: 18.90 #> 27: 16.70 #> 28: 16.90 #> 29: 14.50 #> 30: 15.50 #> 31: 14.60 #> 32: 18.60 #> out_imp_qsec #> #> $outlier_index #> $outlier_index$upper_out_index #> $outlier_index$upper_out_index$mpg #> [1] 20 #> #> $outlier_index$upper_out_index$disp #> numeric(0) #> #> $outlier_index$upper_out_index$wt #> [1] 15 16 17 #> #> $outlier_index$upper_out_index$qsec #> [1] 9 #> #> #> $outlier_index$lower_out_index #> $outlier_index$lower_out_index$mpg #> numeric(0) #> #> $outlier_index$lower_out_index$disp #> numeric(0) #> #> $outlier_index$lower_out_index$wt #> numeric(0) #> #> $outlier_index$lower_out_index$qsec #> numeric(0) #> #> #>