Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
You will learn how to estimate the model performance with mlr3
using resampling techniques such as 5-fold cross-validation. Additionally, you will compare k-NN model against a logistic regression model.
We work with the German credit data. You can either manually create the corresponding mlr3
task as we did before or use a pre-defined task which is already included in the mlr3
package (you can look at the output of as.data.table(mlr_tasks)
to see which other pre-defined tasks that can be used to play around are included in the mlr3
package).
library(mlr3verse)
Loading required package: mlr3
task = tsk("german_credit") task
<TaskClassif:german_credit> (1000 x 21): German Credit * Target: credit_risk * Properties: twoclass * Features (20): - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors, other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status, telephone - int (3): age, amount, duration - ord (3): installment_rate, number_credits, present_residence
task$positive # (check the positive class)
We first create two mlr3
learners, a logistic regression and a KNN learner. We then compare their performance via resampling.
Create the learners
Create a logistic regression learner (store it as an R object called log_reg
) and KNN learner with (store it as an R object called
knn
).
Show Hint 1:
Check as.data.table(mlr_learners)
to find the appropriate learner.
Show Hint 2:
Make sure to have the kknn
package installed.
Set up a resampling instance
Use the mlr3
to set up a resampling instance and store it as an R object called cv5
. Here, we aim for 5-fold cross-validation. A table of possible resampling techniques implemented in mlr3
can be shown by looking at as.data.table(mlr_resamplings)
.
Show Hint 1:
Look at the table returned by as.data.table(mlr_resamplings)
and use the rsmp
function to set up a 5-fold cross-validation instance. Store the result of the rsmp
function in an R object called cv5
.
Show Hint 2:
rsmp("cv")
by default sets up a 10-fold cross-validation instance. The number of folds can be set using an additional argument (see the params
column from as.data.table(mlr_resamplings)
).
Run the resampling
After having created a resampling instance, use it to apply the chosen resampling technique to both previously created learners.
Show Hint 1:
You need to supply the task, the learner and the previously created resampling instance as arguments to the resample
function. See ?resample
for further details and examples.
Show Hint 2:
The key ingredients for resample()
are a task (created by tsk()
), a learner (created by lrn()
) and a resampling strategy (created by rsmp()
), e.g.,
resample(task = task, learner = log_reg, resampling = cv5)
Evaluation
Compute the cross-validated classification accuracy of both models. Which learner performed better?
Show Hint 1:
Use msr("classif.acc")
and the aggregate
method of the resampling object.
Show Hint 2:
res_knn$aggregate(msr(...))
to obtain the classification accuracy averaged across all folds.
We can now apply different resampling methods to estimate the performance of different learners and fairly compare them. We now have learnt how to obtain a better (in terms of variance) estimate of our model performance instead of doing a simple train and test split. This enables us to fairly compare different learners.
Related