The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance.

Identifying and **removing outliers** is challenging with simple statistical methods for most machine learning datasets given the large number of input variables. Instead, automatic outlier detection methods can be used in the modeling pipeline and compared, just like other data preparation transforms that may be applied to the dataset.

In this tutorial, you will discover how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

After completing this tutorial, you will know:

- Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
- How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
- How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

## Tutorial Overview

This tutorial is divided into three parts; they are:

- Outlier Detection and Removal
- Dataset and Performance Baseline
- House Price Regression Dataset
- Baseline Model Performance

- Automatic Outlier Detection
- Isolation Forest
- Minimum Covariance Determinant
- Local Outlier Factor
- One-Class SVM

## Outlier Detection and Removal

Outliers are observations in a dataset that don’t fit in some way.

Perhaps the most common or familiar type of outlier is the observations that are far from the rest of the observations or the center of mass of observations.

This is easy to understand when we have one or two variables and we can visualize the data as a histogram or scatter plot, although it becomes very challenging when we have many input variables defining a high-dimensional input feature space.

In this case, simple statistical methods for identifying outliers can break down, such as methods that use standard deviations or the interquartile range.

It can be important to identify and remove outliers from data when training machine learning algorithms for predictive modeling.

Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Removing outliers from training data prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.

Thankfully, there are a variety of automatic model-based methods for identifying outliers in input data. Importantly, each method approaches the definition of an outlier is slightly different ways, providing alternate approaches to preparing a training dataset that can be evaluated and compared, just like any other data preparation step in a modeling pipeline.

Before we dive into automatic outlier detection methods, let’s first select a standard machine learning dataset that we can use as the basis for our investigation.

### Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

## Dataset and Performance Baseline

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset.

This will provide the context for exploring the outlier identification and removal method of data preparation in the next section.

### House Price Regression Dataset

We will use the house price regression dataset.

This dataset has 13 input variables that describe the properties of the house and suburb and requires the prediction of the median value of houses in the suburb in thousands of dollars.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a regression predictive modeling problem with numerical input variables, each of which has different scales.

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00 0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60 0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70 0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40 0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20 … |

The dataset has many numerical input variables that have unknown and complex relationships. We don’t know that outliers exist in this dataset, although we may guess that some outliers may be present.

The example below loads the dataset and splits it into the input and output columns, splits it into train and test datasets, then summarizes the shapes of the data arrays.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :–1], data[:, –1] # summarize the shape of the dataset print(X.shape, y.shape) # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the train and test sets print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) |

Running the example, we can see that the dataset was loaded correctly and that there are 506 rows of data with 13 input variables and a single target variable.

The dataset is split into train and test sets with 339 rows used for model training and 167 for model evaluation.

(506, 13) (506,) (339, 13) (167, 13) (339,) (167,) |

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

### Baseline Model Performance

It is a regression predictive modeling problem, meaning that we will be predicting a numeric value. All input variables are also numeric.

In this case, we will fit a linear regression algorithm and evaluate model performance by training the model on the test dataset and making a prediction on the test data and evaluate the predictions using the mean absolute error (MAE).

The complete example of evaluating a linear regression model on the dataset is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# evaluate model on the raw dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :–1], data[:, –1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print(‘MAE: %.3f’ % mae) |

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.417. This provides a baseline in performance to which we can compare different outlier identification and removal procedures.

MAE: 3.417 |

Next, we can try removing outliers from the training dataset.

## Automatic Outlier Detection

The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data.

In this section, we will review four methods and compare their performance on the house price dataset.

Each method will be defined, then fit on the training dataset. The fit model will then predict which examples in the training dataset are outliers and which are not (so-called inliers). The outliers will then be removed from the training dataset, then the model will be fit on the remaining examples and evaluated on the entire test dataset.

It would be invalid to fit the outlier detection method on the entire training dataset as this would result in data leakage. That is, the model would have access to data (or information about the data) in the test set not used to train the model. This may result in an optimistic estimate of model performance.

We could attempt to detect outliers on “*new data*” such as the test set prior to making a prediction, but then what do we do if outliers are detected?

One approach might be to return a “*None*” indicating that the model is unable to make a prediction on those outlier cases. This might be an interesting extension to explore that may be appropriate for your project.

### Isolation Forest

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.

Perhaps the most important hyperparameter in the model is the “*contamination*” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

... # identify outliers in the training dataset iso = IsolationForest(contamination=0.1) yhat = iso.fit_predict(X_train) |

Once identified, we can remove the outliers from the training dataset.

... # select all rows that are not outliers mask = yhat != –1 X_train, y_train = X_train[mask, :], y_train[mask] |

Tying this together, the complete example of evaluating the linear model on the housing dataset with outliers identified and removed with isolation forest is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# evaluate model performance with outliers removed using isolation forest from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.ensemble import IsolationForest from sklearn.metrics import mean_absolute_error # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :–1], data[:, –1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset iso = IsolationForest(contamination=0.1) yhat = iso.fit_predict(X_train) # select all rows that are not outliers mask = yhat != –1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print(‘MAE: %.3f’ % mae) |

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that that model identified and removed 34 outliers and achieved a MAE of about 3.189, an improvement over the baseline that achieved a score of about 3.417.

(339, 13) (339,) (305, 13) (305,) MAE: 3.189 |

### Minimum Covariance Determinant

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.

— Minimum Covariance Determinant and Extensions, 2017.

The scikit-learn library provides access to this method via the EllipticEnvelope class.

It provides the “*contamination*” argument that defines the expected ratio of outliers to be observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and error.

... # identify outliers in the training dataset ee = EllipticEnvelope(contamination=0.01) yhat = ee.fit_predict(X_train) |

Once identified, the outliers can be removed from the training dataset as we did in the prior example.

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the elliptical envelope (minimum covariant determinant) method is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# evaluate model performance with outliers removed using elliptical envelope from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.covariance import EllipticEnvelope from sklearn.metrics import mean_absolute_error # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :–1], data[:, –1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset ee = EllipticEnvelope(contamination=0.01) yhat = ee.fit_predict(X_train) # select all rows that are not outliers mask = yhat != –1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print(‘MAE: %.3f’ % mae) |

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that the elliptical envelope method identified and removed only 4 outliers, resulting in a drop in MAE from 3.417 with the baseline to 3.388.

(339, 13) (339,) (335, 13) (335,) MAE: 3.388 |

### Local Outlier Factor

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

The model provides the “*contamination*” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

... # identify outliers in the training dataset lof = LocalOutlierFactor() yhat = lof.fit_predict(X_train) |

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the local outlier factor method is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# evaluate model performance with outliers removed using local outlier factor from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.neighbors import LocalOutlierFactor from sklearn.metrics import mean_absolute_error # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :–1], data[:, –1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset lof = LocalOutlierFactor() yhat = lof.fit_predict(X_train) # select all rows that are not outliers mask = yhat != –1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print(‘MAE: %.3f’ % mae) |

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that the local outlier factor method identified and removed 34 outliers, the same number as isolation forest, resulting in a drop in MAE from 3.417 with the baseline to 3.356. Better, but not as good as isolation forest, suggesting a different set of outliers were identified and removed.

(339, 13) (339,) (305, 13) (305,) MAE: 3.356 |

### One-Class SVM

The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.

… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.

— Estimating the Support of a High-Dimensional Distribution, 2001.

Although SVM is a classification algorithm and One-Class SVM is also a classification algorithm, it can be used to discover outliers in input data for both regression and classification datasets.

The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM class.

The class provides the “*nu*” argument that specifies the approximate ratio of outliers in the dataset, which defaults to 0.1. In this case, we will set it to 0.01, found with a little trial and error.

... # identify outliers in the training dataset ee = OneClassSVM(nu=0.01) yhat = ee.fit_predict(X_train) |

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the one class SVM method is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# evaluate model performance with outliers removed using one class SVM from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.svm import OneClassSVM from sklearn.metrics import mean_absolute_error # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :–1], data[:, –1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset ee = OneClassSVM(nu=0.01) yhat = ee.fit_predict(X_train) # select all rows that are not outliers mask = yhat != –1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print(‘MAE: %.3f’ % mae) |

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that only three outliers were identified and removed and the model achieved a MAE of about 3.431, which is not better than the baseline model that achieved 3.417. Perhaps better performance can be achieved with more tuning.

(339, 13) (339,) (336, 13) (336,) MAE: 3.431 |

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Related Tutorials

### Papers

### APIs

## Summary

In this tutorial, you discovered how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

Specifically, you learned:

- Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
- How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
- How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Like!! Thank you for publishing this awesome article.

Thanks so much for the blog post.

I am regular visitor, how are you everybody? This article posted at this web site is in fact pleasant.

Very good article! We are linking to this particularly great content on our site. Keep up the great writing.

Like!! Great article post.Really thank you! Really Cool.

Thank you ever so for you article post.

Hi there, after reading this amazing paragraph i am as well delighted to share my knowledge here with friends.

I am only commenting to make you be aware of of the superb experience my friend’s princess went through viewing your site. She came to find a lot of things, which include what it is like to have an amazing teaching nature to have many more without problems know specific complicated topics. You truly exceeded people’s expected results. Thanks for producing these warm and friendly, dependable, informative and in addition unique thoughts on the topic to Gloria.

I wanted to compose you that little remark to help thank you so much once again just for the great knowledge you have shared on this page. This is certainly remarkably generous of people like you to grant easily what a number of people could possibly have offered as an e-book in order to make some bucks for themselves, certainly considering that you could have tried it in case you considered necessary. The solutions likewise served like a good way to fully grasp that most people have a similar desire just as my personal own to grasp lots more when it comes to this issue. I believe there are many more pleasant periods in the future for many who see your blog post.

I as well as my pals happened to be checking the great advice on the blog then all of a sudden got a horrible suspicion I had not expressed respect to you for those techniques. All of the guys were definitely absolutely happy to read them and have now honestly been taking pleasure in those things. We appreciate you turning out to be quite accommodating and then for utilizing this kind of incredibly good tips millions of individuals are really desperate to be aware of. My personal honest apologies for not saying thanks to you earlier.

My wife and i were really excited Jordan managed to do his basic research from your ideas he was given in your web site. It’s not at all simplistic to simply always be giving out techniques which usually others may have been making money from. And we also take into account we now have you to give thanks to for that. The most important explanations you made, the simple website navigation, the relationships your site make it possible to instill – it’s got most excellent, and it’s aiding our son in addition to our family know that this situation is interesting, and that’s particularly pressing. Thanks for the whole thing!

Thanks a lot for giving everyone remarkably wonderful chance to read articles and blog posts from this blog. It is often so kind and stuffed with a great time for me and my office peers to visit the blog at the least three times per week to see the newest tips you will have. And indeed, I’m so always astounded considering the good knowledge you give. Some 2 ideas in this post are truly the most efficient we have had.

I precisely needed to say thanks again. I do not know the things that I would’ve worked on without the actual solutions shown by you on such a situation. It has been a real daunting scenario in my position, but coming across the skilled mode you solved the issue forced me to jump for gladness. Now i’m happier for your work and even hope that you really know what a powerful job that you are putting in training the mediocre ones using a site. Probably you have never come across all of us.

I wanted to put you one little remark so as to say thanks a lot again with the stunning techniques you’ve provided at this time. It has been quite remarkably generous with you to present unreservedly what exactly a number of us would have made available for an ebook to help make some money on their own, principally considering that you might well have done it if you ever considered necessary. The guidelines likewise served like a good way to fully grasp the rest have similar desire really like my personal own to know a little more when it comes to this condition. I know there are a lot more pleasant situations in the future for those who look over your blog post.

I intended to compose you a little bit of remark just to say thanks a lot over again considering the pretty tactics you have shared in this case. It was simply generous with people like you to allow easily what most of us would’ve sold as an e-book in making some cash for themselves, specifically considering the fact that you could possibly have tried it in case you desired. These guidelines likewise acted like a easy way to be sure that the rest have a similar fervor similar to my own to learn significantly more on the topic of this condition. I know there are numerous more enjoyable situations up front for individuals who read through your blog.