Resampling

[This article was first published on mlr-org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

You will learn how to estimate the model performance with mlr3 using resampling techniques such as 5-fold cross-validation. Additionally, you will compare k-NN model against a logistic regression model.

We work with the German credit data. You can either manually create the corresponding mlr3 task as we did before or use a pre-defined task which is already included in the mlr3 package (you can look at the output of as.data.table(mlr_tasks) to see which other pre-defined tasks that can be used to play around are included in the mlr3 package).

library(mlr3verse)

Loading required package: mlr3

task = tsk("german_credit")
task

<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
    other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
    telephone
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence

task$positive # (check the positive class)

We first create two mlr3 learners, a logistic regression and a KNN learner. We then compare their performance via resampling.

Create the learners

Create a logistic regression learner (store it as an R object called log_reg) and KNN learner with (store it as an R object called knn).

Show Hint 1:

Check as.data.table(mlr_learners) to find the appropriate learner.

Show Hint 2:

Make sure to have the kknn package installed.

Set up a resampling instance

Use the mlr3 to set up a resampling instance and store it as an R object called cv5. Here, we aim for 5-fold cross-validation. A table of possible resampling techniques implemented in mlr3 can be shown by looking at as.data.table(mlr_resamplings).

Show Hint 1:

Look at the table returned by as.data.table(mlr_resamplings) and use the rsmp function to set up a 5-fold cross-validation instance. Store the result of the rsmp function in an R object called cv5.

Show Hint 2:

rsmp("cv") by default sets up a 10-fold cross-validation instance. The number of folds can be set using an additional argument (see the params column from as.data.table(mlr_resamplings)).

Run the resampling

After having created a resampling instance, use it to apply the chosen resampling technique to both previously created learners.

Show Hint 1:

You need to supply the task, the learner and the previously created resampling instance as arguments to the resample function. See ?resample for further details and examples.

Show Hint 2:

The key ingredients for resample() are a task (created by tsk()), a learner (created by lrn()) and a resampling strategy (created by rsmp()), e.g.,

resample(task = task, learner = log_reg, resampling = cv5)

Evaluation

Compute the cross-validated classification accuracy of both models. Which learner performed better?

Show Hint 1:

Use msr("classif.acc") and the aggregate method of the resampling object.

Show Hint 2:

res_knn$aggregate(msr(...)) to obtain the classification accuracy averaged across all folds.

We can now apply different resampling methods to estimate the performance of different learners and fairly compare them. We now have learnt how to obtain a better (in terms of variance) estimate of our model performance instead of doing a simple train and test split. This enables us to fairly compare different learners.

Create the learners

Set up a resampling instance

Run the resampling

Evaluation

Related

Related

Leave a Reply Cancel reply