Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Please enable JavaScript and reload the page,
or download the source files from
GitHub
and run the code locally.
Apply what you have learned about using pipelines for efficient pre-processing and model training on a regression problem.
In this exercise, we want to model house sale prices in King county in the state of Washington, USA.
set.seed(124) library(mlr3verse) library(mlr3tuningspaces) data("kc_housing", package = "mlr3data")
We do some simple feature pre-processing first:
# Transform time to numeric variable: library(anytime) dates = anytime(kc_housing$date) kc_housing$date = as.numeric(difftime(dates, min(dates), units = "days")) # Scale prices: kc_housing$price = kc_housing$price / 1000 # For this task, delete columns containing NAs: yr_renovated = kc_housing$yr_renovated sqft_basement = kc_housing$sqft_basement kc_housing[,c(13, 15)] = NULL # Create factor columns: kc_housing[,c(8, 14)] = lapply(c(8, 14), function(x) {as.factor(kc_housing[,x])}) # Get an overview: str(kc_housing)
'data.frame': 21613 obs. of 18 variables: $ date : num 164 221 299 221 292 ... $ price : num 222 538 180 604 510 ... $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ... $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ... $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ... $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ... $ floors : num 1 2 1 1 1 1 2 1 1 2 ... $ waterfront : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ... $ view : int 0 0 0 0 0 0 0 0 0 0 ... $ condition : int 3 3 3 5 3 3 3 3 3 3 ... $ grade : int 7 7 6 7 8 11 7 7 7 7 ... $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ... $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ... $ zipcode : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ... $ lat : num 47.5 47.7 47.7 47.5 47.6 ... $ long : num -122 -122 -122 -122 -122 ... $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ... $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ... - attr(*, "index")= int(0)
Before we train a model, let’s reserve some data for evaluating our model later on:
task = as_task_regr(kc_housing, target = "price") split = partition(task, ratio = 0.6) tasktrain = task$clone() tasktrain$filter(split$train) tasktest = task$clone() tasktest$filter(split$test)
In the King county data, there are two categorial features encoded as factor
:
Obviously, waterfront
is a low-cardinality feature suitable for one-hot encoding and zipcode
is a very high-cardinality feature. Therefore, it would make sense to create a pipeline that first pre-processes each factor variable with either impact or one-hot encoding, depending on the feature cardinality.
Filter algorithms select features by assigning numeric scores to each feature, e.g. correlation between features and target variable, use these to rank the features and select a feature subset based on the ranking. Features that are assigned lower scores are then omitted in subsequent modeling steps. All filters are implemented via the package mlr3filters
. A very simple filter approach could look like this:
- Calculate the correlation coefficient between each feature and a numeric target variable
- Select the 10 features with the highest correlation for further modeling steps.
A different strategy could entail selecting only features above a certain threshold of correlation with the outcome. For a full list of all implemented filter methods, take a look at https://mlr3filters.mlr-org.com.
Exercise 1: Create a complex pipeline
Create a pipeline with the following sequence of elements:
- Each factor variable gets pre-processed with either one-hot or impact encoding, depending on the cardinality of the feature.
- A filter selector is applied to the features, sorting them by their Pearson correlation coefficient and selecting the 3 features with the highest correlation.
- A random forest (
regr.ranger
) is trained.
The pipeline should be tuned within an autotuner
with random search, two-fold CV and MSE as performance measure, and a search space from mlr3tuningspaces
but without tuning the hyperparameter replace
. Train the autotuner
on the training data, and evaluate the performance on the holdout test data.
Hint 1:
Check out the help page of lts
from mlr3tuningspaces
.
Hint 2:
Since we want to work with the search space right away, it’s recommended to insert the Learner
directly. Ensure that the learner uses the default value for the replace
hyperparameter.
Exercise 2: Information gain
An alternative filter method is information gain (https://mlr3filters.mlr-org.com/reference/mlr_filters_information_gain.html). Recreate the pipeline from exercise 1, but use information gain as filter. Again, select the three features with the highest information gain. Train the autotuner
on the training data, and evaluate the performance on the holdout test data.
Exercise 3: Pearson correlation vs. Information gain
We receive the following performance scores for the two filter methods:
score_rf_cor
score_rf_info
As you can see, the Pearson correlation filter seems to select features that result in a better model. To investigate why that may have happened, inspect the trained autotuners. Which features have been selected? Given the selected features, reason to what extent which filter methods may be more helpful than others in determining features to select for the model training process.
We learned about more complex pipelines, including pre-processing methods such as variable encoding and feature filtering.
Related