Filters

[This article was first published on mlr-org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Apply what you have learned about using pipelines for efficient pre-processing and model training on a regression problem.

In this exercise, we want to model house sale prices in King county in the state of Washington, USA.

set.seed(124)
library(mlr3verse)
library(mlr3tuningspaces)
data("kc_housing", package = "mlr3data")

We do some simple feature pre-processing first:

# Transform time to numeric variable:
library(anytime)
dates = anytime(kc_housing$date)
kc_housing$date = as.numeric(difftime(dates, min(dates), units = "days"))
# Scale prices:
kc_housing$price = kc_housing$price / 1000
# For this task, delete columns containing NAs:
yr_renovated = kc_housing$yr_renovated
sqft_basement = kc_housing$sqft_basement
kc_housing[,c(13, 15)] = NULL
# Create factor columns:
kc_housing[,c(8, 14)] = lapply(c(8, 14), function(x) {as.factor(kc_housing[,x])})
# Get an overview:
str(kc_housing)

'data.frame':   21613 obs. of  18 variables:
 $ date         : num  164 221 299 221 292 ...
 $ price        : num  222 538 180 604 510 ...
 $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
 $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
 $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
 $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
 $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
 $ waterfront   : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
 $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
 $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
 $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
 $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
 $ zipcode      : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
 $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
 $ long         : num  -122 -122 -122 -122 -122 ...
 $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
 $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
 - attr(*, "index")= int(0)

To test different strategies for feature selection in this exercise, we create two artificial features that are (mostly) uncorrelated with the outcome price:

# Uncorrelated feature x1:
kc_housing$x1 <- runif(n = nrow(kc_housing))
cor(kc_housing$x1, kc_housing$price)

# Uncorrelated feature x2:
kc_housing$x2 <- sin(0.01*kc_housing$price*kc_housing$grade)
cor(kc_housing$x2, kc_housing$price)

Before we train a model, let’s reserve some data for evaluating our model later on:

task = as_task_regr(kc_housing, target = "price")
split = partition(task, ratio = 0.6)

tasktrain = task$clone()
tasktrain$filter(split$train)

tasktest = task$clone()
tasktest$filter(split$test)

In the King county data, there are two categorial features encoded as factor:

Obviously, waterfront is a low-cardinality feature suitable for one-hot encoding and zipcode is a very high-cardinality feature. Therefore, it would make sense to create a pipeline that first pre-processes each factor variable with either impact or one-hot encoding, depending on the feature cardinality.

Filter algorithms select features by assigning numeric scores to each feature, e.g. correlation between features and target variable, use these to rank the features and select a feature subset based on the ranking. Features that are assigned lower scores are then omitted in subsequent modeling steps. All filters are implemented via the package mlr3filters. A very simple filter approach could look like this:

Calculate the correlation coefficient between each feature and a numeric target variable
Select the 10 features with the highest correlation for further modeling steps.

A different strategy could entail selecting only features above a certain threshold of correlation with the outcome. For a full list of all implemented filter methods, take a look at https://mlr3filters.mlr-org.com.

Exercise 1: Create a complex pipeline

Create a pipeline with the following sequence of elements:

Each factor variable gets pre-processed with either one-hot or impact encoding, depending on the cardinality of the feature.
A filter selector is applied to the features, sorting them by their Pearson correlation coefficient and selecting the 3 features with the highest correlation.
A random forest (regr.ranger) is trained.

The pipeline should be tuned within an autotuner with random search, two-fold CV and MSE as performance measure, and a search space from mlr3tuningspaces but without tuning the hyperparameter replace. Train the autotuner on the training data, and evaluate the performance on the holdout test data.

Hint 1:

Check out the help page of lts from mlr3tuningspaces.

Hint 2:

Since we want to work with the search space right away, it’s recommended to insert the Learner directly. Ensure that the learner uses the default value for the replace hyperparameter.

Exercise 2: Information gain

An alternative filter method is information gain (https://mlr3filters.mlr-org.com/reference/mlr_filters_information_gain.html). Recreate the pipeline from exercise 1, but use information gain as filter. Again, select the three features with the highest information gain. Train the autotuner on the training data, and evaluate the performance on the holdout test data.

Exercise 3: Pearson correlation vs. Information gain

We receive the following performance scores for the two filter methods:

score_rf_cor

score_rf_info

As you can see, the Pearson correlation filter seems to select features that result in a better model. To investigate why that may have happened, inspect the trained autotuners. Which features have been selected? Given the selected features, reason to what extent which filter methods may be more helpful than others in determining features to select for the model training process.

We learned about more complex pipelines, including pre-processing methods such as variable encoding and feature filtering.

Exercise 1: Create a complex pipeline

Exercise 2: Information gain

Exercise 3: Pearson correlation vs. Information gain

Related

Related

Leave a Reply Cancel reply